{"id":119,"date":"2017-03-28T21:04:18","date_gmt":"2017-03-28T12:04:18","guid":{"rendered":"http:\/\/www.cl.cs.okayama-u.ac.jp\/?page_id=119"},"modified":"2017-03-28T21:04:55","modified_gmt":"2017-03-28T12:04:55","slug":"jacabit","status":"publish","type":"page","link":"https:\/\/www.cl.cs.okayama-u.ac.jp\/?page_id=119","title":{"rendered":"Japanese Acabit"},"content":{"rendered":"<p>\n<title>Japanese ACABIT system (under construction)<\/title>\n<\/p>\n<h2>JACABIT Japanese term extraction system (under construction)<\/h2>\n<h3>What is this?<\/h3>\n<p>This is free software for extracting Japanese terms from plain text on the basis of POS-based morphological patterns.<\/p>\n<h3>Copyright<\/h3>\n<p>Copyright (c) 2002-2009 Koichi Takeuchi <\/p>\n<p> Redistribution and use in source and binary forms, with or without modification, are perm itted provided that the following conditions are met:<\/p>\n<ul>\n<li>Redistributions of source code must retain the above copyright notice, this list of c onditions and the following disclaimer. <\/li>\n<li>Redistributions in binary form must reproduce the above copyright notice, this list o f conditions and the following disclaimer in the documentation and\/or other materials pro vided with the distribution. <\/li>\n<\/ul>\n<p>THIS SOFTWARE IS PROVIDED BY THE AUTHORS &#8220;AS IS&#8221; AND ANY EXPRESS OR IMPLIED WARRANTIES,  INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR  A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LI ABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES ( INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY , WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARIS ING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUC H DAMAGE.<\/p>\n<p><!--   \n\n\n<ul>\n\n\n<li> Sample data of output Japanese ACABIT system. One of the most  troublesome thing is existing different encoding systems of Japanese characters. Please try to see the following texts by your browser.<br \/> <a href=\"data\/sample_euc.txt\" mce_href=\"data\/sample_euc.txt\">sample on EUC encoding system<\/a><br \/> <a href=\"data\/sampleUTF8.txt\" mce_href=\"data\/sampleUTF8.txt\">sample on UTF-8 encoding system<\/a><br \/> <a href=\"data\/samplesjis.txt\" mce_href=\"data\/samplesjis.txt\">sample on Shift jis encoding system<\/a><br \/> \n<\/ul>\n\n\n--><\/p>\n<h3>Distribution of Japanese ACABIT<\/h3>\n<p>Japanese ACABIT 2009 (Java version, and we are preparing now).<\/p>\n<p>Currently we distribute Japanese ACABIT 2004 <a href=\"pr\/j_acabit1.1.tar.gz\">j_acabit1 .1.tar.gz.<\/a> (17 Sep 2004) <\/p>\n<p> Please extract files by<\/p>\n<p>$ tar xvfz j_acabit1.1.tar.gz  <br \/> Please refer to the following instructions.<\/p>\n<h3>What is Japanese ACABIT1.1 (2004)<\/h3>\n<p>JACABIT1.1 (2004) is a pattern based Japanese term extraction system. The base program code came form ACABIT in 2004,  which is pattern based French term extraction system. ACABIT is produced by Prof Daille, and the  JACABIT is results of collaboration work with Koichi Takeuchi, Kyo Kageura and Beatrice Daille.   Now ACABIT and JACABIT are both the latest versions are produced.  Please refer to them on the Web site (Here will come Daille&#8217;s Web site).<\/p>\n<h3>How to use Japanese ACABIT1.1 (2004)<\/h3>\n<p> JACABIT is a POS-based term extraction system, then the user needs to instsll Japanese POS tagger ChaSen.<\/p>\n<p>Details are see <a href=\"#chasen\">&#8220;How to install ChaSen&#8221;<\/a>.<\/p>\n<p> First of all, we would like to explain the rough sketch of how to  use Jacabit and later we will explain the points to work ChaSen and Japacabit (See the later section of Japanese Code Character Problem).  Jacabit consits of two parts. The first is a preprocessor (postagger.pl) that annotates POS tags to plane text. The second is a body of jacabit (jp_stat.pl and other definition files)that extracts terms using by POS-based patterns.<\/p>\n<ol>\n<li> Install ChaSen (See below) <\/li>\n<li> Install a CPAN library, Unicode::Japnese <\/li>\n<li> $ <a href=\"data\/postagger.pl\">.\/postagger.pl<\/a> <a href=\"data\/sample_input.txt\"> inputtext <\/a> &gt;   <a href=\"data\/sample_postagged.txt\"> POS_tagged_text(xml) <\/a> <\/li>\n<li> $ <a href=\"data\/jp_stat.pl\">jp_stat.pl<\/a> <a href=\"data\/sample_postagged.txt\"> POS_tagged_text(xml)<\/a> <\/li>\n<li> You will obtain the output result file  <a href=\"data\/temp.txt\">temp.txt<\/a> which contains term candidates      with log-likelihood values. <\/li>\n<\/ol>\n<p> All sample_of_input_postagger.txt are sample_of_input_jp_stat.txt are  in j_acabit1.0.tar.gz, so you can see the format of input and output files.  <\/p>\n<p> How to install CPAN libraty<\/p>\n<ol>\n<li> $ su     \/\/ by super user. <\/li>\n<li> # perl -MCPAN -e shell <\/li>\n<li> # &#8230;.  \/\/ shell ask you some questions. <\/li>\n<li> cpan&gt;    \/\/ finally you ge cpan command prompt. <\/li>\n<li> capn&gt;install Unicode::Japanese <\/li>\n<li> capn&gt; exit <\/li>\n<\/ol>\n<ul>\n<li> Note: Each line at the input_text for postagger.pl should have a      sentence. Japanese characters in Input_text should be encoded in UTF-8. <\/li>\n<li> Note2: In the output files (temp.txt, temp2.txt..) Japanese      character code is UTF-8. But in the intermediate file i.e,      POS_tagged_text(xml) Japanese character code is EUC-jp. One of this      changing character code system is because of morphological analyzer      ChaSen. ChaSen can normally accept only EUC-jp. (Recently UTF-8 is avairalbe on ChaS en).      Beside EUC-jp was less      problem to do pattern matching of Japanese characters in perl      module .\/jp_stat.pl at that time.       If the user want to see EUC-jp coded Japanese      text, please convert character system by &#8216;lv&#8217; or &#8216;nkf&#8217; command in      linux system, or use Web browser (such as firefox). <\/li>\n<li> Note3: The temp.txt is not sorted with log likelihood score. Please      apply sort command when you want to obtain sorted term candidates. <\/li>\n<\/ul>\n<h3>Format of output file (temp.txt)<\/h3>\n<p>An example of output file (temp.txt) is below: <\/p>\n<p> 1002.419 \u30b5\u5909 \u540d\u8a5e 1 1 (0\/2.503) 34 * \u30b5\u5909 \u540d\u8a5e (1) <br \/> 1001.433 \u8a9e\u5f59 \u69cb\u9020 4 2 (1.386\/4.852) 17 * \u8a9e\u5f59 \u610f\u5473 \u69cb\u9020 (1) * \u8a9e\u5f59 \u6982\u5ff5 \u69cb\u9020 (1)<br \/> 1000.666 \u9805 \u95a2\u4fc2 6 2 (1.91\/4.852) 14 * \u9805 \u95a2\u4fc2 (2) <br \/> &#8230;&#8230;..<\/p>\n<p> The output file consists of 4 parts logically. The details are as the following table.  <\/p>\n<table align=\"center\">\n<tbody>\n<tr>\n<th>log-likelihood<\/th>\n<th>base two morphemes <\/th>\n<th>intermediate calculation results<\/th>\n<th>terms with (number of occurrence)<\/th>\n<\/tr>\n<tr align=\"left\">\n<td>1002.419<\/td>\n<td>\u30b5\u5909 \u540d\u8a5e<\/td>\n<td>1 1 (0\/2.503) 34<\/td>\n<td>* \u30b5\u5909 \u540d\u8a5e (1)<\/td>\n<\/tr>\n<tr align=\"left\">\n<td>1001.433<\/td>\n<td>\u8a9e\u5f59 \u69cb\u9020<\/td>\n<td>4 2 (1.386\/4.852) 17<\/td>\n<td>* \u8a9e\u5f59 \u610f\u5473 \u69cb\u9020 (1) * \u8a9e\u5f59 \u6982\u5ff5 \u69cb\u9020 (1)<\/td>\n<\/tr>\n<tr align=\"left\">\n<td>1000.666<\/td>\n<td>\u9805 \u95a2\u4fc2<\/td>\n<td>6 2 (1.91\/4.852) 14<\/td>\n<td>* \u9805 \u95a2\u4fc2 (2)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p> The first column is a log-likelihood as &#8216;1002.419&#8217; which indicates that how strongly connected between the  base two words in a term.  (Note that JACABIT does not sort log-likelihood at the temp.txt. If the user want to sort them, please apply UNIX sort command to temp.txt.) <\/p>\n<p> The second column denotes base two words &#8220;\u30b5\u5909 \u540d\u8a5e&#8221; consisting of an extracted term candidate&#8220;\u30b5\u5909 \u540d\u8a5e&#8221;.   <\/p>\n<p> The third column is intermediate numbers for calculation of log likelihood (as &#8216;1 1 (0\/2.503) 34&#8217;).  <\/p>\n<p> The forth column e.g., &#8220;* \u30b5\u5909 \u540d\u8a5e (1)&#8221; shows term candidates with number of occurrence.  In this case, &#8220;\u30b5\u5909 \u540d\u8a5e&#8221; occurs  once in the corpus. (&#8220;*&#8221; is a delimiter of candidates.)  <\/p>\n<p> In the second row, &#8220;* \u8a9e\u5f59 \u610f\u5473 \u69cb\u9020 (1) * \u8a9e\u5f59 \u6982\u5ff5 \u69cb\u9020 (1)&#8221; denotes two term candidates that can be considered delivated from the base two words &#8220;\u8a9e\u5f59 \u69cb\u9020&#8221;.  <\/p>\n<p> This is a characteristic of JACABIT. JACABIT regards all terms as delivations from base two-word terms, then tries to connect the base two wordw with term candidates.  <\/p>\n<p><\/p>\n<h3>The rules of Japanese terms in ACABIT.<\/h3>\n<p>Japanese ACABIT consists of a main program and grammar programs. The main grammar file  is <a href=\"pr\/jp_def.pl\">jp_def.pl<\/a>. The related supplemental definisions are in jp_form.pl, jp_type.pl, jp_defgram.pl.  You can see a lot of rules of Japanese terms.<\/p>\n<h3><a name=\"chasen\"> How to install ChaSen (June\/15\/2009)<\/a><\/h3>\n<p>The following instuction is how to install ChaSen (EUC Japanese code version) to linux system. ChaSen is normally installed under \/usr\/local, then you need to be root.   Install steps are below:<\/p>\n<ol>\n<li> install libiconv if your system does not have iconv. <a href=\"http:\/\/www.gnu.org\/sof tware\/libiconv\/\">from here <\/a> <\/li>\n<li> install double array program darts-0.2.tar.gz from <a href=\"http:\/\/chasen.org\/~taku\/ software\/darts\/src\/\">darts version 2.0<\/a>\n<ul>\n<li> Note: darts version 3.0 does not work, so please use version 2.0 <\/li>\n<\/ul>\n<\/li>\n<li> download ChaSen from <a href=\"http:\/\/chasen.aist-nara.ac.jp\/stable\/chasen\/chasen-2.3.3.tar.gz\"> http:\/\/chasen.aist-nara.ac.jp\/stable\/chasen\/chasen-2.3.3.tar.gz<\/a> <\/li>\n<li> install ChaSen <\/li>\n<li> downlaod Japanese dictionary ipadic-2.7.0 <a href=\"http:\/\/chasen.aist-nara.ac.j p\/stable\/ipadic\/ipadic-2.7.0.tar.gz\"> <\/a> <\/li>\n<li> install dictionary. <\/li>\n<\/ol>\n<p>Details are here:<\/p>\n<p> 2. Install Darts 2.0<\/p>\n<ul>\n<li> $ tar xvfz darts-0.2.tar.gz <\/li>\n<li> $ ls <\/li>\n<li> darts-0.2 <\/li>\n<li> $ cd .\/darts-0.2 <\/li>\n<li> $ .\/configure <\/li>\n<li> $ make check <\/li>\n<li> su <\/li>\n<li> #make install <\/li>\n<\/ul>\n<p>The libraries of darts are installed under \/usr\/local\/. <\/p>\n<p> 3. Install ChaSen 2.3.3<\/p>\n<ul>\n<li> $ tar xvfz chasen-2.3.3.tar.gz <\/li>\n<li> $ .\/configure &#8211;with-darts=\/usr\/local\/include &#8211;with-libiconv=\/usr\/local\n<ul>\n<li> before make, you have to debug a line in dartsdic.cpp. <\/li>\n<li> $ vi .\/chasen-2.3.3\/lib\/dartsdic.cpp <\/li>\n<li> please edit at 180 line\n<ul>\n<li> wrong: (const char*)keys[size] = key.data(); <\/li>\n<li> correct: keys[size] = (char *)key.data(); <\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li> $ make <\/li>\n<li> $ su <\/li>\n<li> # make install <\/li>\n<\/ul>\n<p>Then ChaSen&#8217;s libraties are installed under \/usr\/local\/.  <\/p>\n<p> 4. Install ipadic-2.7.0<\/p>\n<ul>\n<li> $ tar xvfz ipadic-2.7.0.tar.gz <\/li>\n<li> $ cd .\/ipadic-2.7.0 <\/li>\n<li> $ .\/configure <\/li>\n<li> $ make check <\/li>\n<li> su <\/li>\n<li> # make install <\/li>\n<\/ul>\n<p>5. How to use Chasen<\/p>\n<ul>\n<li> $ \/usr\/local\/bin\/chasen JapaneseTextFile <\/li>\n<\/ul>\n<p>Then you will see the results in STDOUT from ChaSen.  The JapaneseTextFile consists of one sentence per one line. Also Japanese character code is to be EUC-jp. Then if you try to  check ChaSen, 1) you get Japanese text, 2) convert Japanese text to EUC-jp encod, 3) apply ChaSen to the Japanese_text, 4) save ChaSen&#8217;s results into a file, 5) see a file with FireFox. <\/p>\n<p> If we apply &#8220;\u3053\u308c\u304c\u5f62\u614b\u7d20\u89e3\u6790\u306e\u7d50\u679c\u3067\u3059\uff0e&#8221; then the output of ChaSen is below. <\/p>\n<p> \u3053\u308c   \u30b3\u30ec  \u3053\u308c    \u540d\u8a5e-\u4ee3\u540d\u8a5e-\u4e00\u822c<br \/> \u304c     \u30ac    \u304c     \u52a9\u8a5e-\u683c\u52a9\u8a5e-\u4e00\u822c<br \/> \u5f62\u614b\u7d20   \u30b1\u30a4\u30bf\u30a4\u30bd \u5f62\u614b\u7d20  \u540d\u8a5e-\u4e00\u822c<br \/> \u89e3\u6790   \u30ab\u30a4\u30bb\u30ad    \u89e3\u6790     \u540d\u8a5e-\u30b5\u5909\u63a5\u7d9a<br \/> \u306e   \u30ce    \u306e     \u52a9\u8a5e-\u9023\u4f53\u5316<br \/> \u7d50\u679c     \u30b1\u30c3\u30ab  \u7d50\u679c    \u540d\u8a5e-\u526f\u8a5e\u53ef\u80fd<br \/> \u3067\u3059     \u30c7\u30b9    \u3067\u3059    \u52a9\u52d5\u8a5e  \u7279\u6b8a\u30fb\u30c7\u30b9      \u57fa\u672c\u5f62<br \/> \uff0e      \uff0e      \uff0e      \u8a18\u53f7-\u53e5\u70b9<br \/> EOS<\/p>\n<p> Reference<\/p>\n<ul>\n<li> <a href=\"http:\/\/chasen-legacy.sourceforge.jp\/\"> Japanese morphohlogical analyzer ChaSen&#8217;s Web site (in Japanese) <\/a> <\/li>\n<\/ul>\n<p>Reference<\/p>\n<ul>\n<li>K.Takeuchi, K. Kageura, T. Koyama, B. Daille and L. Romary:       Construction of Grammar-based Term Extraction      Model for Japanese, In proceedings of COMPUTERM2004, pp.91-94,      2004. <\/li>\n<\/ul>\n<hr \/>\n<p><small>written by Koichi on June\/16\/2009<\/small><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Japanese ACABIT system (under construction) JACABIT Japanese term extraction system (under construction) What  &hellip; <a href=\"https:\/\/www.cl.cs.okayama-u.ac.jp\/?page_id=119\">\u7d9a\u304d\u3092\u8aad\u3080 <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"parent":92,"menu_order":20,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-119","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=\/wp\/v2\/pages\/119","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=119"}],"version-history":[{"count":1,"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=\/wp\/v2\/pages\/119\/revisions"}],"predecessor-version":[{"id":120,"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=\/wp\/v2\/pages\/119\/revisions\/120"}],"up":[{"embeddable":true,"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=\/wp\/v2\/pages\/92"}],"wp:attachment":[{"href":"https:\/\/www.cl.cs.okayama-u.ac.jp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}