Giter VIP home page Giter VIP logo

sudachidict's Introduction

SudachiDict

A lexicon for Japanese tokenizer Sudachi.

Download

Click here for pre-built dictionaries.

Pre-built synonym dictionaries for Chikkar is here.

Python packages

You can install the dictionaries for WorksApplications/SudachiPy, the Python version of Sudachi, as Python packages.

In SudachiPy v0.5.2 and later, you can specify a dictionary directly from a command line or program.

WARNING: sudachipy link is no longer available in SudachiPy v0.5.2 and later.

please see the following links for more details on the dictionary option.

Install

pip install sudachidict_core
pip install sudachidict_small
pip install sudachidict_full

Dictionary types

Sudachi has three types of dictionaries.

  • Small: includes only the vocabulary of UniDic
  • Core: includes basic vocabulary (default)
  • Full: includes miscellaneous proper nouns

Build from sources

Dictionary sources were hosted on git lfs, but are hosted on S3 now. They will be moved to github in the future once more.

At the moment, you need to manually download required files from the AWS S3, and unzip them into the src/main/text directory. Core dictionary requires small and core files, Full requires all three files.

Licenses

SudachiDict by Works Applications Co., Ltd. is licensed under the [Apache License, Version2.0](http://www.apache.org/licenses/LICENSE-2.0.html)

   Copyright (c) 2017-2023 Works Applications Co., Ltd.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

This project includes UniDic and a part of NEologd.

sudachidict's People

Contributors

azu avatar dobatymo avatar eiennohito avatar hiroshi-matsuda-rit avatar kazuma-t avatar khiyowa avatar mocobeta avatar sorami avatar t-yamamura avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sudachidict's Issues

downloading sudachidict_core dictionary

Hello,

After trying to install the sudachidict_core dictionary from pip, the process suddenly stops without fully downloading the dictionary. It also happened with the other dictionaries, my internet connection is good and stable. Maybe I'm missing something, but just in case I just wanted to let you know,

Bests regards

core_lex.csv and notcore_lex.csv have \u**** characters

Hello,

core_lex.csv and notcore_lex.csv have \u**** characters.
I checked them with ripgrep on Arch Linux.

rg '\\u' core_lex.csv > core_lex_broken_entries.txt
rg '\\u' notcore_lex.csv > notcore_lex_broken_entries.txt

Examples.

# core_lex_broken_entries.txt
納付書・領収\u0028納付受託\u0029証書,5133,5146,32767,納付書・領収\u0028納付受託\u0029証書,名詞,普通名詞,一般,*,*,*,ノウフショ・リョウシュウ\u0028ノウフジュタク\u0029ショウショ,納付書・領収\u0028納付受託\u0029証書,*,C,617408/506627/268971/747421/784703/617408/338603/784704/680506,1462768/268971/747421/784703/617408/338603/784704/680506,1462768/268971/747421/784703/617408/338603/784704/680506,*

# notcore_lex_broken_entries.txt
バジルドライ,5144,5671,5157,バジルドライ,名詞,固有名詞,一般,*,*,*,バジルドライ,バジル\u0028ドライ\u0029,*,C,233848/227036,233848/227036,233848/227036,*

Are they OK?

Thank you for providing a big dictionary.

Some entries which have wrong reading

I found some entries which have wrong readings, so I'd like to report them.

筋向こう,-1,-1,0,筋向こう,名詞,普通名詞,一般,*,*,*,スジムカイ,筋向かい,*,A,*,*,*,024717
洋からし,5145,5145,7380,洋からし,名詞,普通名詞,一般,*,*,*,ヨウガラシ,洋辛子,*,A,*,*,*,*
指づめ,5142,5142,7864,指づめ,名詞,普通名詞,一般,*,*,*,ユビツメ,指つめ,*,A,*,*,*,*
譲る葉,5142,5142,7864,譲る葉,名詞,普通名詞,一般,*,*,*,ユズリハ,譲り葉,*,A,*,*,*,*
夕まずめ,5142,5142,7864,夕まずめ,名詞,普通名詞,一般,*,*,*,ユウマズミ,夕まずみ,*,A,*,*,*,*
向こうはじ,5142,5142,7864,向こうはじ,名詞,普通名詞,一般,*,*,*,ムコウハシ,向こう端,*,A,*,*,*,*
妙ちくりん,5672,5672,7759,妙ちくりん,形状詞,一般,*,*,*,*,ミョウチキリン,妙ちきりん,*,A,*,*,*,*
棒ちぎれ,5145,5145,7380,棒ちぎれ,名詞,普通名詞,一般,*,*,*,ボウチギリ,棒ちぎり,*,A,*,*,*,*
増こう,5146,5146,7404,増こう,名詞,普通名詞,一般,*,*,*,ゾウスウ,増嵩,*,A,*,*,*,*
嵩ずる,982,982,9205,嵩ずる,動詞,一般,*,*,上一段-ザ行,終止形-一般,コウジル,高じる,400135,A,*,*,*,*
嵩ずる,988,988,9147,嵩ずる,動詞,一般,*,*,上一段-ザ行,連体形-一般,コウジル,高じる,400135,A,*,*,*,*
じょう油,5146,5146,10290,じょう油,名詞,普通名詞,一般,*,*,*,ショウユ,醤油,*,A,*,*,*,013415

Analysis issue with common sentence: ご迷惑をおかけして申し訳ありませんでした。

(I hope this is the correct place to report this issue.)

Using the latest core dictionary, in any mode, there is this problem (it finds the word 決して):

$ echo ご迷惑をおかけして申し訳ありませんでした。 |  java -jar sudachi-0.5.2.jar -m B
ご	接頭辞,*,*,*,*,*	御
迷惑	名詞,普通名詞,サ変形状詞可能,*,*,*	迷惑
を	助詞,格助詞,*,*,*,*	を
おか	動詞,非自立可能,*,*,五段-カ行,未然形-一般	おく
けして	副詞,*,*,*,*,*	決して
申し訳	名詞,普通名詞,一般,*,*,*	申し訳
あり	動詞,非自立可能,*,*,五段-ラ行,連用形-一般	有る
ませ	助動詞,*,*,*,助動詞-マス,未然形-一般	ます
ん	助動詞,*,*,*,助動詞-ヌ,終止形-撥音便	ず
でし	助動詞,*,*,*,助動詞-デス,連用形-一般	です
た	助動詞,*,*,*,助動詞-タ,終止形-一般	た
。	補助記号,句点,*,*,*,*	。
EOS

but this works correctly:

$ echo おかけします |  java -jar sudachi-0.5.2.jar -m B
お	接頭辞,*,*,*,*,*	御
かけ	動詞,非自立可能,*,*,下一段-カ行,連用形-一般	掛ける
し	動詞,非自立可能,*,*,サ行変格,連用形-一般	為る
ます	助動詞,*,*,*,助動詞-マス,終止形-一般	ます
EOS

Comply with Section 4 of Apache License Version 2.0

Motivation

Thank you for applying the assets of NEologd.

I found a small problem while doing a technical survey.

Unfortunately, as for NEologd only, I feel that the current implementation for python packages does not yet comply with the redistribution agreement described in section 4 of Apache License Version 2.0.

  • http://www.apache.org/licenses/LICENSE-2.0#redistribution

    1. Redistribution.
      You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
      a. You must give any other recipients of the Work or Derivative Works a copy of this License; and
      b. You must cause any modified files to carry prominent notices stating that You changed the files; and
      c. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
      d. If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.

Goal

When time permits, please include the LEAGAL file not only this repository and sudachi-dictionary files but also following files.

  • Python packages
    • SudachiDict_small-YYYYMMDD.tar.gz
    • SudachiDict_core-YYYYMMDD.tar.gz
    • SudachiDict_full-YYYYMMDD.tar.gz

Error 401 when downloading dictionnaries from an old verison

When I install the sudachidict-core library to the 20200722 version with pip, I have the following error :

Downloading the Sudachi dictionary (It may take a while) ...
[...]
 raise HTTPError(req.full_url, code, msg, hdrs, fp)
    urllib.error.HTTPError: HTTP Error 401: Unauthorized

With the 20200722 version, the dictionary is downloaded from this url :
https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/sudachi-dictionary-20200722-core.zip

But all files from this storage return a HTTP 401 error.

With the new 20201223 version, the dictionaries are migrated to an S3 storage. So, what is the status of the old storage ? We must upgrade to the new 20201223 version ? Is it a temporary issue ?

error creating a wheel file

I'm seeing an error when I attempt to create a whl file on linux

git checkout tags/v20201223
env_name=$(basename $(pwd))
ANACONDA_DIR="${HOME}/anaconda"
source $ANACONDA_DIR/etc/profile.d/conda.sh
conda create -n $env_name python=3.7 -y
conda activate $env_name
cd python/

I've tried both pip wheel . and python setup.py bdist_wheel on the latest commit and also the latest tag, but I get this error

Downloading the Sudachi dictionary (It may take a while) ...
Traceback (most recent call last):
  File "setup.py", line 43, in <module>
    _, _msg = urlretrieve(ZIP_URL, ZIP_NAME)
  File "/home/localstepdo/anaconda/envs/SudachiDict/lib/python3.7/urllib/request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
  File "/home/localstepdo/anaconda/envs/SudachiDict/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/home/localstepdo/anaconda/envs/SudachiDict/lib/python3.7/urllib/request.py", line 531, in open
    response = meth(req, response)
  File "/home/localstepdo/anaconda/envs/SudachiDict/lib/python3.7/urllib/request.py", line 641, in http_response
    'http', request, response, code, msg, hdrs)
  File "/home/localstepdo/anaconda/envs/SudachiDict/lib/python3.7/urllib/request.py", line 569, in error
    return self._call_chain(*args)
  File "/home/localstepdo/anaconda/envs/SudachiDict/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/home/localstepdo/anaconda/envs/SudachiDict/lib/python3.7/urllib/request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Invalid URI: isHexDigit

Any advice would be greatly appreciated

pip install stuck at setup.py

pip install sudachidict_full are stucking at setup.py

Is there any workaround?

Windows 10 x64
Python 3.10
Pip last version

How to compile the dictionary?

I am sorry for novice question, but I ran the following command
sh package_python.sh and it gave the following errors/warnings:

rm: target/python/20190718: No such file or directory
cp: target/system_small.dic: No such file or directory

Where can I find system_small.dic?

Contains many hangeul terms in notcore_lex.csv

There are some hungeul terms can be found in notcore_lex.csv file. Such as follows:

전범국,4785,4785,22000,전범국,名詞,固有名詞,一般,*,*,*,センパンコク,戦犯国,*,A,*,*,*,*
전지충이,4785,4785,22000,전지충이,名詞,固有名詞,一般,*,*,*,チョンジチュンイ,デンヂムシ,*,A,*,*,*,*
전툴라,4785,4785,22000,전툴라,名詞,固有名詞,一般,*,*,*,チョントゥラ,チョントゥラ,*,A,*,*,*,*

Are they intentionally contained?

Normalization of すみません and すいません differs

Is this the right place to report linguistic issues with the dictionaries? Apologies if not.

Using Sudachi 0.4.3, the core dictionary version 20200722, and mode C, I noticed that すみません and すいません do not normalize to the same verb, and it seems like they should.

For すいません, the normalized verb is 済む, which seems correct:

すい	動詞,一般,*,*,五段-マ行,連用形-イ音便	済む
ませ	助動詞,*,*,*,助動詞-マス,未然形-一般	ます
ん	助動詞,*,*,*,助動詞-ヌ,終止形-撥音便	ず

For すみません, the normalized verb is すむ. It seems like it should be 済む also?

すみ	動詞,一般,*,*,五段-マ行,連用形-一般	すむ
ませ	助動詞,*,*,*,助動詞-マス,未然形-一般	ます
ん	助動詞,*,*,*,助動詞-ヌ,終止形-撥音便	ず

synonyms.txt may contain sensitive words

I found "公僕" in SudachiDict/src/main/text/synonyms.txt.
If the inclusion is as intended, close this issue.

In some cases, "公僕" can be perceived as discriminatory terms.
You may want to delete.

快い appears to have incorrect normalized form

With the latest core dictionary:

$ echo 快い | java -jar sudachi-0.5.2.jar -m B -a
快い	形容詞,非自立可能,*,*,形容詞,終止形-一般	良い	快い	イイ	0	[]	
EOS

Note that the normalized form is 良い, not 快い as expected.

But this seems correct:

$ echo 快くない | java -jar sudachi-0.5.2.jar -m B -a
快く	形容詞,一般,*,*,形容詞,連用形-一般	快い	快い	ココロヨク	0	[]	
ない	形容詞,非自立可能,*,*,形容詞,終止形-一般	無い	ない	ナイ	0	[]	
EOS

Update the installation script for pip

$ pip install SudachiDict-core
Collecting SudachiDict-core
  Downloading SudachiDict-core-20221021.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: SudachiPy<0.7,>=0.5 in ./.pyenv/versions/3.11.1/lib/python3.11/site-packages (from SudachiDict-core) (0.6.6)
Installing collected packages: SudachiDict-core
  DEPRECATION: SudachiDict-core is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
  Running setup.py install for SudachiDict-core ... done
Successfully installed SudachiDict-core-20221021

Normalized form changed (こだわる, 拘る) is correct?

Thank you everyone.
I have a question about the changes in v20211220.

# v20211220
こだわる,1414,1414,12976,こだわる,動詞,一般,*,*,五段-ラ行,終止形-一般,コダワル,こだわる,59504,A,*,*,*,*
拘る,1414,1414,9931,拘る,動詞,一般,*,*,五段-ラ行,終止形-一般,コダワル,拘る,463547,A,*,*,*,*

# v20210802
こだわる,1414,1414,12976,こだわる,動詞,一般,*,*,五段-ラ行,終止形-一般,コダワル,こだわる,59516,A,*,*,*,*
拘る,1414,1414,9931,拘る,動詞,一般,*,*,五段-ラ行,終止形-一般,コダワル,こだわる,463727,A,*,*,*,*

"拘る" normalized form ( 語彙素 ) has been changed as followings.

"こだわる" => "拘る"

Is this change as expected?
I think, rather v20210802 seems to be correct.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.