edict2 multi radical search

How to search the edict2 file for entire compounds matching radicals instead of just kanji.

For identifying single kanji in a picture I use this site:

http://jisho.org/kanji/radicals/

But some time can be saved with compounds by removing compounds from all the possible compounds for one kanji which do not have kanji that match the given radicals. So you don’t have to go through several pages of compounds or look up the other kanji.

For instance if you see 相違点, and you already know 点, and you clearly see the radicals: “木目” “込”, then the combination of those three elements leaves only the compound 相違点.

First get edict2. It updates a lot, so save the commands.

wget -O - 'ftp://ftp.monash.edu.au/pub/nihongo/edict2.gz' | gunzip | iconv -f eucjp -t utf8 > edict2.utf
wget -O - 'ftp://ftp.monash.edu.au/pub/nihongo/radkfile.gz' | gunzip | iconv -f eucjp -t utf8 > radkfile.utf
wget -O - 'ftp://ftp.monash.edu.au/pub/nihongo/kradfile.gz' | gunzip | iconv -f eucjp -t utf8 > kradfile.utf
for strokes in $(seq 1 17); do echo "$strokes\t$(grep "\$.*\" radkfile.utf| cut -d\ -f2 | xargs echo)"; done > radicals_ordered.txt

It will also get the radical databases (radical to kanji and kanji to radical) and make a text file with radicals for lookup. It will look like this:

1 一 | 丶 ノ 乙 亅
2 二 亠 人 化 个 儿 入 ハ 并 冂 冖 冫 几 凵 刀 刈 力 勹 匕 匚 十 卜 卩 厂 厶 又 マ 九 ユ 乃
3 込 口 囗 土 士 夂 夕 大 女 子 宀 寸 小 尚 尢 尸 屮 山 川 巛 工 已 巾 干 幺 广 廴 廾 弋 弓 ヨ 彑 彡 彳 忙 扎 汁 犯 艾 邦 阡 也 亡 及 久
4 老 心 戈 戸 手 支 攵 文 斗 斤 方 无 日 曰 月 木 欠 止 歹 殳 比 毛 氏 气 水 火 杰 爪 父 爻 爿 片 牛 犬 礼 王 元 井 勿 尤 五 屯 巴 毋
5 玄 瓦 甘 生 用 田 疋 疔 癶 白 皮 皿 目 矛 矢 石 示 禹 禾 穴 立 初 世 巨 冊 母 買 牙
6 瓜 竹 米 糸 缶 羊 羽 而 耒 耳 聿 肉 自 至 臼 舌 舟 艮 色 虍 虫 血 行 衣 西
7 臣 見 角 言 谷 豆 豕 豸 貝 赤 走 足 身 車 辛 辰 酉 釆 里 舛 麦
8 金 長 門 隶 隹 雨 青 非 奄 岡 免 斉
9 面 革 韭 音 頁 風 飛 食 首 香 品
10 馬 骨 高 髟 鬥 鬯 鬲 鬼 竜 韋
11 魚 鳥 鹵 鹿 麻 亀 滴 黄 黒
12 黍 黹 無 歯
13 黽 鼎 鼓 鼠
14 鼻 齊
15
16
17 龠

Finally the script

#! /bin/bash
kanjigroups=$(for radi in $@; do sed -n "$(echo $radi | sed 's~.~s/&/\&/\;T\;~g;s~$~p~')" kradfile.utf | cut -d\ -f1 | xargs echo | sed 's/^/[/;s/$/]/;s/ //g'; done)
echo "$kanjigroups" | while read line; do [ "$line" = "[]" ] && echo a group of radicals did not match && exit 1; : ;done || exit 1
sed -n "$(echo "$kanjigroups" | sed 's~.*~s/&/\&/\;T\;~g;$ap' | xargs echo)" edict2.utf | sort | uniq

What it does is for every argument it constructs a regexp of the form [...] with kanji that match the radicals. If an argument has multiple radicals then just the kanji that match all.

That last bit was hard because grep can’t do
echo abc | grep a | grep b | grep c
conveniently, or can it?

So I used a cascading sed check with s…;T;p. Here is a mall example that will see ab does not have all of abc:
echo ab | sed -n "$(echo abc | sed 's~.~s/&/\&/\;T\;~g;s~$~p~')"

Finally all the “OR” groups are “AND”ed and checked in edict2. It checks the entire line of edict so it will output and check the definitions too. The first is convenient, the second can produce unwanted results.

Usage:
make sure it is in the same directory as the dictionaries.
./radicomp 木目 込 | grep 点

ALTERNATIVE

#! /bin/bash
kanjigroups=$(for radi in $@; do sed -n "$(echo $radi | sed 's~.~s/&/\&/\;T\;~g;s~$~p~')" kradfile.utf | cut -d\ -f1 | xargs echo | sed 's/^/[/;s/$/]/;s/ //g'; done)
echo "$kanjigroups" | while read line; do [ "$line" = "[]" ] && echo a group of radicals did not match && exit 1; : ;done || exit 1
grep "\b$(echo $kanjigroups | sed 's/ //g')\b" edict2.utf | sort | uniq

This script will match the arguments to kanji strictly.
Example:
$ ./radicomp 川貝 刈
順列 [じゅんれつ] /(n) permutation/

So that would save some time in most cases.

For comparison, this is what the old method returns

$ radicomp 川貝 刈
ハイレベルデータリンク制御手順 [ハイレベルデータリンクせいぎょてじゅん] /(n) {comp} High-level Data Link Control/HDLC/
パケット順序制御 [パケットじゅんじょせいぎょ] /(n) {comp} packet sequencing/
割付け順番 [わりつけじゅんばん] /(n) {comp} sequential layout order/
単一型順序列型 [たんいつがたじゅんじょれつがた] /(n) {comp} sequence-of type/
呼制御手順 [こせいぎょてじゅん] /(n) {comp} call control procedure/
帰順 [きじゅん] /(n,vs) submission/return to allegiance/
文字順列 [もじじゅんれつ] /(n) {comp} character sequence/
昇順整列 [しょうじゅんせいれつ] /(n) {comp} sort (in ascending order)/
順列 [じゅんれつ] /(n) permutation/
順序列型 [じゅんじょれつがた] /(n) {comp} sequence type/
順序制御 [じゅんじょせいぎょ] /(n) {comp} sequencing/
順番列 [じゅんばんれつ] /(n) {comp} sequence/

In addition for single kanji searches that don’t turn up in edict2 like 待, the \b[待]\b will fail, so you could search the kanjidic too.


wget 'ftp://ftp.monash.edu.au/pub/nihongo/kanjidic.gz' | gunzip | iconv -f eucjp -t utf8 > kanjidic.utf

And in the script
grep -h "\b$(echo $kanjigroups | sed 's/ //g')\b" edict2.utf kanjidic.utf | sort | uniq

This entry was posted in edict2, Programming. Bookmark the permalink.

3 Responses to edict2 multi radical search

  1. procyon says:

    I added a bit of fool-proving on entering an invalid combination of radicals.
    It came down to the line:
    echo "$kanjigroups" | while read line; do [ "$line" = "[]" ] && echo a group of radicals did not match && exit 1; : ;done || exit 1

    Without it the argument itself would be ignored.

  2. procyon says:

    Alternatively, it is preferable to be stricter with searching edict2 for compounds with as many kanji as there are arguments and no more with every argument matching a kanji in order.
    You can do that by replacing the last line with:
    grep "\b$(echo $kanjigroups | sed 's/ //g')\b" edict2.utf | sort | uniq

    It also seems to work with just a dot.

  3. procyon says:

    Splitting things up further… For even more control you can use the script:

    #! /bin/bash
    kanjigroups=$(for radi in $@; do sed -n "$(echo $radi | sed 's~.~s/&/\&/\;T\;~g;s~$~p~')" kradfile.utf | cut -d\ -f1 | xargs echo | sed 's/^/[/;s/$/]/;s/ //g'; done | tr -d '\n' | sed 's/ //g')
    echo "$kanjigroups" | grep -qF '[]' && echo a group of radicals did not match >&2 && echo XXX && exit 1
    echo "$kanjigroups"

    And use it like this:

    $ grep "\b$(./radical_to_regex 田)$(kana i)$(./radical_to_regex 刀)$(kana ri)\b" edict2.utf
    思い切り(P);思いっきり(P);思いっ切り;思いきり [おもいきり(思い切り;思いきり)(P);おもいっきり(思いっきり;思いっ切り)(P)] /(adv,n) with all one's strength/with all one's heart/resignation/resolution/(P)/

    Using multiple arguments to radical_to_regex will produce kanji choices in a row.

    The kana script is in ‘handy command line programs’ in the Forums.

Comments are closed.