(mecab-usersへの投稿の再掲載)

mecab-ipadicのCRF学習モデルを公開しました。

http://code.google.com/p/mecab/downloads/detail?name=mecab-ipadic-2.7.0-20070801.model.bz2

モデルファイルを用いることで

1. ユーザ辞書の単語の自動コスト推定
http://mecab.googlecode.com/svn/trunk/mecab/doc/dic.html
2. 少量の辞書・学習データを用いたモデルの再学習 / ドメイン適応
http://mecab.googlecode.com/svn/trunk/mecab/doc/learn.html#retrain

が行えます。2の具体例を紹介します。なお、現在の学習データの文字コードの制約上すべて EUC-JP としてください。現モデルファイルも
EUC-JPです。(いいかげんUTF8にしたいと思っていますが)

$WORKが現在の作業ディレクトリです。


1. mecab-ipadic, とモデルの例

% cd $WORK
% bzip2 bzip2 -d mecab-ipadic-2.7.0-20070801.model.bz2
% tar zxfv mecab-ipadic-2.7.0-20070801.tar.gz
% ls
 mecab-ipadic-2.7.0-20070801 mecab-ipadic-2.7.0-20070801.model

2. 学習データの作成 (ファイル名 train)

以下のように、MeCabの出力結果と同じフォーマットで学習データを作ります。今は、終助詞「なう」と助動詞「まーす」を追加しています。


京都    名詞,固有名詞,地域,一般,,,京都,キョウト,キョート
なう    助詞,終助詞,,,,,なう,ナウ,ナウ
EOS
ラーメン        名詞,一般,,,,,ラーメン,ラーメン,ラーメン
なう    助詞,終助詞,,,,,なう,ナウ,ナウ
EOS
行っ    動詞,自立,,,五段・カ行促音便,連用タ接続,行く,イッ,イッ
て      助詞,接続助詞,,,,,て,テ,テ
き      動詞,非自立,,,カ変・クル,連用形,くる,キ,キ
まーす  助動詞,,,*,特殊・マス,基本形,まーす,マース,マース
EOS


3. 辞書への追加 (mecab-ipadic-2.7.0-20070801/add.csv)
新規語彙を、mecab-ipadic-2.7.0-20070801 以下に新規 csv ファイルに記述します

なう,0,0,0,助詞,終助詞,,,,,なう,ナウ,ナウ
まーす,0,0,0,助動詞,,,*,特殊・マス,基本形,まーす,マース,マース

4. 学習の実行
まず、mecab-dict-index で、新規語彙が追加された辞書をコンパイルし、新規辞書と新規コーパスを使い学習します。

% /usr/local/libexec/mecab/mecab-dict-index -f euc-jp -t euc-jp -d
mecab-ipadic-2.7.0-20070801 -o mecab-ipadic-2.7.0-20070801
reading mecab-ipadic-2.7.0-20070801/unk.def ... 40
emitting double-array: 100% |###########################################|
mecab-ipadic-2.7.0-20070801/model.def is not found. skipped.
reading mecab-ipadic-2.7.0-20070801/Noun.adjv.csv ... 3328
reading mecab-ipadic-2.7.0-20070801/Verb.csv ... 130750
reading mecab-ipadic-2.7.0-20070801/Noun.demonst.csv ... 120
reading mecab-ipadic-2.7.0-20070801/Suffix.csv ... 1393
reading mecab-ipadic-2.7.0-20070801/Noun.others.csv ... 151
reading mecab-ipadic-2.7.0-20070801/Adj.csv ... 27210
reading mecab-ipadic-2.7.0-20070801/Conjunction.csv ... 171
reading mecab-ipadic-2.7.0-20070801/Noun.name.csv ... 34202
reading mecab-ipadic-2.7.0-20070801/Postp.csv ... 146
reading mecab-ipadic-2.7.0-20070801/Interjection.csv ... 252
reading mecab-ipadic-2.7.0-20070801/Adverb.csv ... 3032
reading mecab-ipadic-2.7.0-20070801/Adnominal.csv ... 135
reading mecab-ipadic-2.7.0-20070801/Noun.nai.csv ... 42
reading mecab-ipadic-2.7.0-20070801/Noun.csv ... 60477
reading mecab-ipadic-2.7.0-20070801/Prefix.csv ... 221
reading mecab-ipadic-2.7.0-20070801/Noun.verbal.csv ... 12146
reading mecab-ipadic-2.7.0-20070801/Postp-col.csv ... 91
reading mecab-ipadic-2.7.0-20070801/Noun.place.csv ... 72999
reading mecab-ipadic-2.7.0-20070801/Symbol.csv ... 208
reading mecab-ipadic-2.7.0-20070801/add.csv ... 2
reading mecab-ipadic-2.7.0-20070801/Others.csv ... 2
reading mecab-ipadic-2.7.0-20070801/Noun.org.csv ... 16668
reading mecab-ipadic-2.7.0-20070801/Filler.csv ... 19
reading mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv ... 795
reading mecab-ipadic-2.7.0-20070801/Noun.number.csv ... 42
reading mecab-ipadic-2.7.0-20070801/Auxil.csv ... 199
reading mecab-ipadic-2.7.0-20070801/Noun.proper.csv ... 27327
emitting double-array: 100% |###########################################|
reading mecab-ipadic-2.7.0-20070801/matrix.def ... 1316x1316
emitting matrix      : 100% |###########################################|

done!

% /usr/local/libexec/mecab/mecab-cost-train -M
mecab-ipadic-2.7.0-20070801.model -d mecab-ipadic-2.7.0-20070801 train
new_model
Using previous model: mecab-ipadic-2.7.0-20070801.model
--cost --freq and --eta options are overwritten.
reading corpus ...
Number of sentences: 3
Number of features:  1029250
eta:                 0.00005
freq:                1
eval-size:           8
unk-eval-size:       4
threads:             1
charset:             euc-jp
C(sigma^2):          1.00000

iter=0 err=0.00000 F=1.00000 target=0.68291 diff=1.00000
iter=1 err=0.00000 F=1.00000 target=0.52948 diff=0.22467
iter=2 err=0.00000 F=1.00000 target=0.34616 diff=0.34623
iter=3 err=0.00000 F=1.00000 target=0.39982 diff=0.15501
iter=4 err=0.00000 F=1.00000 target=0.18924 diff=0.52668
iter=5 err=0.00000 F=1.00000 target=0.18608 diff=0.01672
iter=6 err=0.00000 F=1.00000 target=0.18260 diff=0.01866
iter=7 err=0.00000 F=1.00000 target=0.18253 diff=0.00039
iter=8 err=0.00000 F=1.00000 target=0.18253 diff=0.00003
iter=9 err=0.00000 F=1.00000 target=0.18252 diff=0.00001
iter=10 err=0.00000 F=1.00000 target=0.18252 diff=0.00000

Done! writing model file ...


5. 解析辞書の作成
新規モデルを使い新し辞書・連接表を構築します。new_dic ディレクトリに辞書が構築されます

% /usr/local/libexec/mecab/mecab-dict-gen -d
mecab-ipadic-2.7.0-20070801 -o new_dic -m new_model
new_model is not a binary model. reopen it as text mode...
reading mecab-ipadic-2.7.0-20070801/unk.def ... 40
reading mecab-ipadic-2.7.0-20070801/Noun.adjv.csv ... 3328
reading mecab-ipadic-2.7.0-20070801/Verb.csv ... 130750
reading mecab-ipadic-2.7.0-20070801/Noun.demonst.csv ... 120
reading mecab-ipadic-2.7.0-20070801/Suffix.csv ... 1393
reading mecab-ipadic-2.7.0-20070801/Noun.others.csv ... 151
reading mecab-ipadic-2.7.0-20070801/Adj.csv ... 27210
reading mecab-ipadic-2.7.0-20070801/Conjunction.csv ... 171
reading mecab-ipadic-2.7.0-20070801/Noun.name.csv ... 34202
reading mecab-ipadic-2.7.0-20070801/Postp.csv ... 146
reading mecab-ipadic-2.7.0-20070801/Interjection.csv ... 252
reading mecab-ipadic-2.7.0-20070801/Adverb.csv ... 3032
reading mecab-ipadic-2.7.0-20070801/Adnominal.csv ... 135
reading mecab-ipadic-2.7.0-20070801/Noun.nai.csv ... 42
reading mecab-ipadic-2.7.0-20070801/Noun.csv ... 60477
reading mecab-ipadic-2.7.0-20070801/Prefix.csv ... 221
reading mecab-ipadic-2.7.0-20070801/Noun.verbal.csv ... 12146
reading mecab-ipadic-2.7.0-20070801/Postp-col.csv ... 91
reading mecab-ipadic-2.7.0-20070801/Noun.place.csv ... 72999
reading mecab-ipadic-2.7.0-20070801/Symbol.csv ... 208
reading mecab-ipadic-2.7.0-20070801/add.csv ... 2
reading mecab-ipadic-2.7.0-20070801/Others.csv ... 2
reading mecab-ipadic-2.7.0-20070801/Noun.org.csv ... 16668
reading mecab-ipadic-2.7.0-20070801/Filler.csv ... 19
reading mecab-ipadic-2.7.0-20070801/Noun.adverbal.csv ... 795
reading mecab-ipadic-2.7.0-20070801/Noun.number.csv ... 42
reading mecab-ipadic-2.7.0-20070801/Auxil.csv ... 199
reading mecab-ipadic-2.7.0-20070801/Noun.proper.csv ... 27327
emitting new_dic/left-id.def/ new_dic/right-id.def
emitting new_dic/unk.def ... 40
emitting new_dic/Noun.adjv.csv ... 3328
emitting new_dic/Verb.csv ... 130750
emitting new_dic/Noun.demonst.csv ... 120
emitting new_dic/Suffix.csv ... 1393
emitting new_dic/Noun.others.csv ... 151
emitting new_dic/Adj.csv ... 27210
emitting new_dic/Conjunction.csv ... 171
emitting new_dic/Noun.name.csv ... 34202
emitting new_dic/Postp.csv ... 146
emitting new_dic/Interjection.csv ... 252
emitting new_dic/Adverb.csv ... 3032
emitting new_dic/Adnominal.csv ... 135
emitting new_dic/Noun.nai.csv ... 42
emitting new_dic/Noun.csv ... 60477
emitting new_dic/Prefix.csv ... 221
emitting new_dic/Noun.verbal.csv ... 12146
emitting new_dic/Postp-col.csv ... 91
emitting new_dic/Noun.place.csv ... 72999
emitting new_dic/Symbol.csv ... 208
emitting new_dic/add.csv ... 2
emitting new_dic/Others.csv ... 2
emitting new_dic/Noun.org.csv ... 16668
emitting new_dic/Filler.csv ... 19
emitting new_dic/Noun.adverbal.csv ... 795
emitting new_dic/Noun.number.csv ... 42
emitting new_dic/Auxil.csv ... 199
emitting new_dic/Noun.proper.csv ... 27327
emitting matrix      : 100% |###########################################|
copying mecab-ipadic-2.7.0-20070801/char.def to new_dic/char.def
copying mecab-ipadic-2.7.0-20070801/rewrite.def to new_dic/rewrite.def
copying mecab-ipadic-2.7.0-20070801/dicrc to new_dic/dicrc
copying mecab-ipadic-2.7.0-20070801/feature.def to new_dic/feature.def
copying new_model to new_dic/model.def

done!

6. 新規辞書のコンパイル

% /usr/local/libexec/mecab/mecab-dict-index -f euc-jp -t utf8 -d
new_dic -o new_dic
new_dic/pos-id.def is not found. minimum setting is used
reading new_dic/unk.def ... 40
emitting double-array: 100% |###########################################|
new_dic/pos-id.def is not found. minimum setting is used
reading new_dic/Noun.adjv.csv ... 3328
reading new_dic/Verb.csv ... 130750
reading new_dic/Noun.demonst.csv ... 120
reading new_dic/Suffix.csv ... 1393
reading new_dic/Noun.others.csv ... 151
reading new_dic/Adj.csv ... 27210
reading new_dic/Conjunction.csv ... 171
reading new_dic/Noun.name.csv ... 34202
reading new_dic/Postp.csv ... 146
reading new_dic/Interjection.csv ... 252
reading new_dic/Adverb.csv ... 3032
reading new_dic/Adnominal.csv ... 135
reading new_dic/Noun.nai.csv ... 42
reading new_dic/Noun.csv ... 60477
reading new_dic/Prefix.csv ... 221
reading new_dic/Noun.verbal.csv ... 12146
reading new_dic/Postp-col.csv ... 91
reading new_dic/Noun.place.csv ... 72999
reading new_dic/Symbol.csv ... 208
reading new_dic/add.csv ... 2
reading new_dic/Others.csv ... 2
reading new_dic/Noun.org.csv ... 16668
reading new_dic/Filler.csv ... 19
reading new_dic/Noun.adverbal.csv ... 795
reading new_dic/Noun.number.csv ... 42
reading new_dic/Auxil.csv ... 199
reading new_dic/Noun.proper.csv ... 27327
emitting double-array: 100% |###########################################|
reading new_dic/matrix.def ... 1318x1318
emitting matrix      : 100% |###########################################|

done!

7. 解析

% echo echo 六本木なう | mecab -d new_dic
六本木  名詞,固有名詞,地域,一般,,,六本木,ロッポンギ,ロッポンギ
なう    助詞,終助詞,,,,,なう,ナウ,ナウ
EOS


% echo そんなこと知ってまーす | mecab -d new_dic
そんな  連体詞,,,,,*,そんな,ソンナ,ソンナ
こと    名詞,非自立,一般,,,*,こと,コト,コト
知っ    動詞,自立,,,五段・ラ行,連用タ接続,知る,シッ,シッ
て      助詞,接続助詞,,,,,て,テ,テ
まーす  助動詞,,,*,特殊・マス,基本形,まーす,マース,マース
EOS


「なう」も 「まーす」 も付属語なので、語彙化されており、連接表のサイズが 1316→1318に増えています。


% head -2 mecab-ipadic-2.7.0-20070801/matrix.def
1316 1316
0 0 -434

% head -2 new_dic/matrix.def
1318 1318
0 0 -260
Shared publiclyView activity