Giter VIP home page Giter VIP logo

Comments (10)

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
The credit for the Mozc dictionary is included in the  proprietary version, 
Google Japanese Input.
We are basically using both ipadic and naist-jdic, but the current dictionary 
is mainly based on ipadic.
We found it difficult to use naist-jdic due to the quality issues reported 
below.
http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/000418.html

I think this issue can be resolved if we fixed the issue 6. If it is OK, i want 
to mark this bug as duplicated.

Original comment by [email protected] on 11 Aug 2010 at 2:02

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
Thanks for your comments.

> The credit for the Mozc dictionary is included in the  proprietary version, 
Google Japanese Input.
> We are basically using both ipadic and naist-jdic, but the current dictionary 
is mainly based on ipadic.

# you wrote "mainly based on ipadic", not "only ipadic".

I do not know whether you use ipa-doc and naist-jdic of which version. 
Will not the license conflict when you used nasit-jdic after license was 
changed? 
And, how does the license turn out if these two data are mixed?

It is not written whether the dictionary data of mozc were generated by ipadic, 
naist-jdic of which license now. I am glad when you write to README about this.

> We found it difficult to use naist-jdic due to the quality issues reported 
below.
> http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/000418.html
> 
> I think this issue can be resolved if we fixed the issue 6. If it is OK, i 
want to mark this bug as duplicated.

Original comment by [email protected] on 11 Aug 2010 at 8:23

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
I think you might misunderstand the difference between ipadic and naist-jdic.
Although naist-jdic stems from ipadic, these two dictionaries are technically 
different in terms of license. credts_ja.html, included in Google Japanese 
input, has "both" license terms. As far as I know, the license of ipadic and 
naist-jdic had not changed since their initial release. I'm wondering why 
version information and which license is so important in this situation.

Anyway, here's the version of naist-jdic and ipadic Mozc uses
- mecab-naist-jdic-0.4.3-20080917
- mecab-ipadic-2.7.0-20070801 (ipadic-2.7.0)

We will add an extra description to README.txt

Thanks.

Original comment by [email protected] on 12 Aug 2010 at 5:05

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
Hi Taku-san,
mecab-ipadic is marked as "non-free" and
mecab-naist-jdic is marked as "free".
http://packages.debian.org/search?lang=en&keywords=mecab

I think Iwamatsu-san is a maintainer of the Debian Mozc packages.
http://packages.debian.org/en/squeeze/mozc-server
Debian users love Mozc, but they can't include non-free packages in the Debian 
official ISOs.

I also found Tagoh-san's tweets.
He maintains Red Hat/Fedora Japanese packages.
http://twit411.com/tagoh
> 
mozcの辞書のライセンスの扱いはどうなってるんだろう。ipad
icといっしょ? #mozc
> mozc辞書のライセンス続き: 
Debianのパッケージはmainなのね。
> 
ipadicはnon-freeみたいだけど、mozcのdebian/copyrightみても辞書に�
��触れてないなー #mozc
> data/installer/credits_{en,ja}.htmlにはipadicと naist-jdic両方の
> 
ライセンスが明記されてるのは確認した。つまりどういう��
�と?

Original comment by [email protected] on 14 Aug 2010 at 9:42

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
I know that ipadic is marked as "non-free" package. One of the goals of 
naist-jdic is to clear the license issue so that Debian user can include it in 
official package.

We once attempted to use naist-jdic, but, unfortunately, we found that 
naist-jdic has several critical quality issues. In short, many common words 
cannot correctly be analyzed with mecab-naist-jdic. Here's the details:
http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/000418.html

Also, the quality of naist-jdic  is not stable right now, which made us hard to 
maintain and debug the conversion results and qualities. Once all the quality 
issues of naist-jdic are resolved, we will switch from ipadic to naist-jdic, 
but we don't have any concrete plans yet.

Original comment by [email protected] on 16 Aug 2010 at 3:18

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
Comment to 3:

Thanks for your comments.
OK, the dictionary data are two states that different licenses are mixed in.
# Maybe, IANAL, I think BSD license with "関連法令に違反しない限
#り、本プログラムそのもの、または本プログラムの変更版�
��第三者へ自由に配
布することができる。" of clause.

BTW, would you teach which file is related to ipadic? dictionary0.txt? 
dictionary1.txt?

Original comment by [email protected] on 18 Aug 2010 at 2:38

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
Both dictionary0.txt and dictionary1.txt

We split the entire text data into two files simply because we found that not 
all source code management system can handle our large text dictionary.
We are going to split it into more files.

Original comment by [email protected] on 24 Aug 2010 at 7:34

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
Hi,

I was wondering how to make mozc 100% free while keeping it BSD license
friendly with minimum efforts. 

My conclusion are:
 * Conversion quality issue of naist-jdic is not directly related to the mecab
   data quality which naist is working on.
 * The small scope of data mined to create naist-jdic causes its content to be
   skewed and incomplete.  This is __the root cause__.
 * Use of ipadic is essentially equivalent of ICOT dictionary + naist-jdic.
   * ICOT dictionary providing more nouns and kanji jukugo.
   * naist-jdic providing grammatical context data
 * We should get as much or even better result using alternative data.
   * edict package is CC-SA license (FREE like BSD) which contains a lot of
     good data although not exactly mecab ready.
     http://packages.debian.org/source/sid/edict
   * edict can provide pronunciation and coarse grammar assignment
   * edict has a huge separate proper name data but 地名 人名 are mixed.
 * I do not know how to sneak edict data yet but looks like dummy low frequency
   value may be better than not having data.

Let me explain why I thought this way.

As I understand, 
 * the quality of mozc conversion using only naist-jdic is not as good as one
    using ipadic.
 * naist-jdic is created by manually removing dictionary data coming from ICOT
 * dictionary data by ICOT is non-free and present in ipadic
 * naist-jdic has updated contents than older ipadic

Although *quality* of naist-jdic is questioned in
http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/thread.html#4
18
, following the discussion thread made me think a bit on this.  The concern for
「る次」 was raised but it was explained and was actually an improvement.  
Then
http://sourceforge.jp/projects/mecab/lists/archive/users/2010-July/000424.html
was posted.  This was interesting and seemed to directly linked to the
*quality* concern for conversion of mozc. 

The lack of basic words like 季節 奇怪 in naist-jdic as pointed out in the
discussion will certainly degrade conversion.

In order to assess actual situation, the missing words in naist-jdic was
investigated by me.  

mecab-ipadic-2.7.0+20070801
   392,126 data entries with gramatical/statistical data
   173,936 number of uniq base shape words

mecab-naist-jdic-0.6.3-20100801
   485,893 data entries with gramatical/statistical data
   180,943 number of uniq base shape words

So naist-jdic is bigger data with more words.

naist-jdic adds more words than missing words.  Even if some words were missing
in the naist-jdic, they could be found in edict which is much larger 
dictibnary.  
 * compdic    1,2683 computer related words
 * edict     192,345 normal dictionary entries (漢字+読み+文法)
 * enamdict  730,648 proper name dictionary entries(漢字+読み+文法)
 * kanjidic    6,356 の単漢字の 音読と訓読など(not used this time)

Here are some examples of words in this category (only Kanji 熟語 here):

 不可逆 乱高下 会舘 似非 傍迷惑 僻心 内冑 利高 割線 割線法 可読性 否定積
 奇怪 季節 家教 巾偏 帳面面 当事者 憶病 敬啓 数数 時期 棒縞 正反対 無神
 社会**党 社共 私供 空相場 細石 継端 脱稿 複合 軟論 ...

Since these can be found in edict data, if we figure how to sneak edict data,
these are non-issues. 

So here are the real missing words which I found.  Basically, mecab data being
newspaper article and edict being created by someone in teaching Japanese
language, other fields are sometimes missing but not many.  I may have missed a
few.  I can tell you that I went through all the kanji words with Python script
and did sort uniq diff etc.  So I can say this is almost through list of
missing words.

== COMPLETELY MISSING ===
speciality interest area such as technical words.
    律速 無向 額装 発炎筒

Some archane letter format wordings:
    敬呈 敬啓

historical names:
    交趾 士爵

== PARTIALLY MISSING ==

corporate names:
    フイガロ技研   (「技研」はjdicにもある)
    横河トレーディング (「横河」はjdicにもある)
    (too many to list)

合成語:語根は辞書にある。id.def(接頭辞)に「全」「非��
�はあるが「半」「不」「反」「正」「副」「零」が見当た�
��ない。
    不可逆 非一致 非可分 零交差

合成語:語根は辞書にある。suffix.txtに「型」「形」「波」�
��子(し)」などの技術系接尾詞がない
    零交差波 離散形 電信形

誤字のデーターが省かれた -> 正字は辞書にある!
    不倶載天    (正)不倶戴天
    散慢  (正)散漫

This last ones proves going to naist-jdic is good thing.

[email protected]

Original comment by [email protected] on 30 Jul 2011 at 6:14

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
There is no problem for MOZC to use IPADIC as below.  So problem solved.

After careful review of IPADIC license, I realize this is free license.  
Complain on "Legal" was found to be baseless claim which should have been 
debated and denied long time ago.  I recently took action on this.  I got 
Debian FTP master to agree with me to accept IPADIC as DFSG complient. Then 
MOZC was accepted as DFSG FREE!  Bravo!  So there is no problem.





Original comment by [email protected] on 16 Nov 2011 at 2:52

from mozc.

GoogleCodeExporter avatar GoogleCodeExporter commented on August 23, 2024
Bravo!  I'm very glad to here that.

Thank you very much for taking the effort.

Original comment by [email protected] on 16 Nov 2011 at 4:24

  • Changed state: Done

from mozc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.