JACABIT Japanese term extraction system (under construction)
What is this?
This is free software for extracting Japanese terms from plain text on the basis of POS-based morphological patterns.
Copyright (c) 2002-2009 Koichi Takeuchi
Redistribution and use in source and binary forms, with or without modification, are perm itted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of c onditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list o f conditions and the following disclaimer in the documentation and/or other materials pro vided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE AUTHORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LI ABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES ( INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY , WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARIS ING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUC H DAMAGE.
Distribution of Japanese ACABIT
Japanese ACABIT 2009 (Java version, and we are preparing now).
Currently we distribute Japanese ACABIT 2004 j_acabit1 .1.tar.gz. (17 Sep 2004)
Please extract files by
$ tar xvfz j_acabit1.1.tar.gz
Please refer to the following instructions.
What is Japanese ACABIT1.1 (2004)
JACABIT1.1 (2004) is a pattern based Japanese term extraction system. The base program code came form ACABIT in 2004, which is pattern based French term extraction system. ACABIT is produced by Prof Daille, and the JACABIT is results of collaboration work with Koichi Takeuchi, Kyo Kageura and Beatrice Daille. Now ACABIT and JACABIT are both the latest versions are produced. Please refer to them on the Web site (Here will come Daille’s Web site).
How to use Japanese ACABIT1.1 (2004)
JACABIT is a POS-based term extraction system, then the user needs to instsll Japanese POS tagger ChaSen.
Details are see “How to install ChaSen”.
First of all, we would like to explain the rough sketch of how to use Jacabit and later we will explain the points to work ChaSen and Japacabit (See the later section of Japanese Code Character Problem). Jacabit consits of two parts. The first is a preprocessor (postagger.pl) that annotates POS tags to plane text. The second is a body of jacabit (jp_stat.pl and other definition files)that extracts terms using by POS-based patterns.
- Install ChaSen (See below)
- Install a CPAN library, Unicode::Japnese
- $ ./postagger.pl inputtext > POS_tagged_text(xml)
- $ jp_stat.pl POS_tagged_text(xml)
- You will obtain the output result file temp.txt which contains term candidates with log-likelihood values.
All sample_of_input_postagger.txt are sample_of_input_jp_stat.txt are in j_acabit1.0.tar.gz, so you can see the format of input and output files.
How to install CPAN libraty
- $ su // by super user.
- # perl -MCPAN -e shell
- # …. // shell ask you some questions.
- cpan> // finally you ge cpan command prompt.
- capn>install Unicode::Japanese
- capn> exit
- Note: Each line at the input_text for postagger.pl should have a sentence. Japanese characters in Input_text should be encoded in UTF-8.
- Note2: In the output files (temp.txt, temp2.txt..) Japanese character code is UTF-8. But in the intermediate file i.e, POS_tagged_text(xml) Japanese character code is EUC-jp. One of this changing character code system is because of morphological analyzer ChaSen. ChaSen can normally accept only EUC-jp. (Recently UTF-8 is avairalbe on ChaS en). Beside EUC-jp was less problem to do pattern matching of Japanese characters in perl module ./jp_stat.pl at that time. If the user want to see EUC-jp coded Japanese text, please convert character system by ‘lv’ or ‘nkf’ command in linux system, or use Web browser (such as firefox).
- Note3: The temp.txt is not sorted with log likelihood score. Please apply sort command when you want to obtain sorted term candidates.
Format of output file (temp.txt)
An example of output file (temp.txt) is below:
1002.419 サ変 名詞 1 1 (0/2.503) 34 * サ変 名詞 (1)
1001.433 語彙 構造 4 2 (1.386/4.852) 17 * 語彙 意味 構造 (1) * 語彙 概念 構造 (1)
1000.666 項 関係 6 2 (1.91/4.852) 14 * 項 関係 (2)
The output file consists of 4 parts logically. The details are as the following table.
|log-likelihood||base two morphemes||intermediate calculation results||terms with (number of occurrence)|
|1002.419||サ変 名詞||1 1 (0/2.503) 34||* サ変 名詞 (1)|
|1001.433||語彙 構造||4 2 (1.386/4.852) 17||* 語彙 意味 構造 (1) * 語彙 概念 構造 (1)|
|1000.666||項 関係||6 2 (1.91/4.852) 14||* 項 関係 (2)|
The first column is a log-likelihood as ‘1002.419’ which indicates that how strongly connected between the base two words in a term. (Note that JACABIT does not sort log-likelihood at the temp.txt. If the user want to sort them, please apply UNIX sort command to temp.txt.)
The second column denotes base two words “サ変 名詞” consisting of an extracted term candidate“サ変 名詞”.
The third column is intermediate numbers for calculation of log likelihood (as ‘1 1 (0/2.503) 34’).
The forth column e.g., “* サ変 名詞 (1)” shows term candidates with number of occurrence. In this case, “サ変 名詞” occurs once in the corpus. (“*” is a delimiter of candidates.)
In the second row, “* 語彙 意味 構造 (1) * 語彙 概念 構造 (1)” denotes two term candidates that can be considered delivated from the base two words “語彙 構造”.
This is a characteristic of JACABIT. JACABIT regards all terms as delivations from base two-word terms, then tries to connect the base two wordw with term candidates.
The rules of Japanese terms in ACABIT.
Japanese ACABIT consists of a main program and grammar programs. The main grammar file is jp_def.pl. The related supplemental definisions are in jp_form.pl, jp_type.pl, jp_defgram.pl. You can see a lot of rules of Japanese terms.
The following instuction is how to install ChaSen (EUC Japanese code version) to linux system. ChaSen is normally installed under /usr/local, then you need to be root. Install steps are below:
- install libiconv if your system does not have iconv. from here
- install double array program darts-0.2.tar.gz from darts version 2.0
- Note: darts version 3.0 does not work, so please use version 2.0
- download ChaSen from http://chasen.aist-nara.ac.jp/stable/chasen/chasen-2.3.3.tar.gz
- install ChaSen
- downlaod Japanese dictionary ipadic-2.7.0
- install dictionary.
Details are here:
2. Install Darts 2.0
- $ tar xvfz darts-0.2.tar.gz
- $ ls
- $ cd ./darts-0.2
- $ ./configure
- $ make check
- #make install
The libraries of darts are installed under /usr/local/.
3. Install ChaSen 2.3.3
- $ tar xvfz chasen-2.3.3.tar.gz
- $ ./configure –with-darts=/usr/local/include –with-libiconv=/usr/local
- before make, you have to debug a line in dartsdic.cpp.
- $ vi ./chasen-2.3.3/lib/dartsdic.cpp
- please edit at 180 line
- wrong: (const char*)keys[size] = key.data();
- correct: keys[size] = (char *)key.data();
- $ make
- $ su
- # make install
Then ChaSen’s libraties are installed under /usr/local/.
4. Install ipadic-2.7.0
- $ tar xvfz ipadic-2.7.0.tar.gz
- $ cd ./ipadic-2.7.0
- $ ./configure
- $ make check
- # make install
5. How to use Chasen
- $ /usr/local/bin/chasen JapaneseTextFile
Then you will see the results in STDOUT from ChaSen. The JapaneseTextFile consists of one sentence per one line. Also Japanese character code is to be EUC-jp. Then if you try to check ChaSen, 1) you get Japanese text, 2) convert Japanese text to EUC-jp encod, 3) apply ChaSen to the Japanese_text, 4) save ChaSen’s results into a file, 5) see a file with FireFox.
If we apply “これが形態素解析の結果です．” then the output of ChaSen is below.
これ コレ これ 名詞-代名詞-一般
が ガ が 助詞-格助詞-一般
形態素 ケイタイソ 形態素 名詞-一般
解析 カイセキ 解析 名詞-サ変接続
の ノ の 助詞-連体化
結果 ケッカ 結果 名詞-副詞可能
です デス です 助動詞 特殊・デス 基本形
． ． ． 記号-句点
- K.Takeuchi, K. Kageura, T. Koyama, B. Daille and L. Romary: Construction of Grammar-based Term Extraction Model for Japanese, In proceedings of COMPUTERM2004, pp.91-94, 2004.
written by Koichi on June/16/2009