worksapplications / sudachi.rs Goto Github PK

View Code? Open in Web Editor NEW

297.0 7.0 34.0 11.78 MB

Sudachi in Rust 🦀 and new generation of SudachiPy

License: Apache License 2.0

Shell 0.59% Rust 89.09% Python 10.11% Makefile 0.09% Batchfile 0.11%

sudachi nlp-libary tokenization morphological-analysis segmentation pos-tagging rust python

sudachi.rs's Introduction

sudachi.rs - English README

2023-12-14 UPDATE: 0.6.8 Release

Try it:

pip install --upgrade 'sudachipy>=0.6.8'

sudachi.rs is a Rust implementation of Sudachi, a Japanese morphological analyzer.

日本語 README SudachiPy Documentation

TL;DR

$ git clone https://github.com/WorksApplications/sudachi.rs.git
$ cd ./sudachi.rs

$ cargo build --release
$ cargo install --path sudachi-cli/
$ ./fetch_dictionary.sh

$ echo "高輪ゲートウェイ駅" | sudachi
高輪ゲートウェイ駅  名詞,固有名詞,一般,*,*,*    高輪ゲートウェイ駅
EOS

Example

Multi-granular Tokenization

$ echo 選挙管理委員会 | sudachi
選挙管理委員会  名詞,固有名詞,一般,*,*,*        選挙管理委員会
EOS

$ echo 選挙管理委員会 | sudachi --mode A
選挙    名詞,普通名詞,サ変可能,*,*,*    選挙
管理    名詞,普通名詞,サ変可能,*,*,*    管理
委員    名詞,普通名詞,一般,*,*,*        委員
会      名詞,普通名詞,一般,*,*,*        会
EOS

Normalized Form

$ echo 打込む かつ丼 附属 vintage | sudachi
打込む  動詞,一般,*,*,五段-マ行,終止形-一般     打ち込む
        空白,*,*,*,*,*
かつ丼  名詞,普通名詞,一般,*,*,*        カツ丼
        空白,*,*,*,*,*
附属    名詞,普通名詞,サ変可能,*,*,*    付属
        空白,*,*,*,*,*
vintage 名詞,普通名詞,一般,*,*,*        ビンテージ
EOS

Wakati (space-delimited surface form) Output

$ cat lemon.txt
えたいの知れない不吉な塊が私の心を始終圧えつけていた。
焦躁と言おうか、嫌悪と言おうか――酒を飲んだあとに宿酔があるように、酒を毎日飲んでいると宿酔に相当した時期がやって来る。
それが来たのだ。これはちょっといけなかった。

$ sudachi --wakati lemon.txt
えたい の 知れ ない 不吉 な 塊 が 私 の 心 を 始終 圧え つけ て い た 。
焦躁 と 言おう か 、 嫌悪 と 言おう か ― ― 酒 を 飲ん だ あと に 宿酔 が ある よう に 、 酒 を 毎日 飲ん で いる と 宿酔 に 相当 し た 時期 が やっ て 来る 。
それ が 来 た の だ 。 これ は ちょっと いけ なかっ た 。

Setup

You need sudachi.rs, default plugins, and a dictionary. (This crate don't include dictionary.)

1. Get the source code

$ git clone https://github.com/WorksApplications/sudachi.rs.git

2. Download a Sudachi Dictionary

Sudachi requires a dictionary to operate. You can download a dictionary ZIP file from WorksApplications/SudachiDict (choose one from small, core, or full), unzip it, and place the system_*.dic file somewhere. By the default setting file, sudachi.rs assumes that it is placed at resources/system.dic.

Convenience Script

Optionally, you can use the fetch_dictionary.sh shell script to download a dictionary and install it to resources/system.dic.

$ ./fetch_dictionary.sh

3. Build

$ cargo build --release

Build (bake dictionary into binary)

This was un-implemented and does not work currently, see #35

Specify the bake_dictionary feature to embed a dictionary into the binary. The sudachi executable will contain the dictionary binary. The baked dictionary will be used if no one is specified via cli option or setting file.

You must specify the path the dictionary file in the SUDACHI_DICT_PATH environment variable when building. SUDACHI_DICT_PATH is relative to the sudachi.rs directory (or absolute).

Example on Unix-like system:

# Download dictionary to resources/system.dic
$ ./fetch_dictionary.sh

# Build with bake_dictionary feature (relative path)
$ env SUDACHI_DICT_PATH=resources/system.dic cargo build --release --features bake_dictionary

# or

# Build with bake_dictionary feature (absolute path)
$ env SUDACHI_DICT_PATH=/path/to/my-sudachi.dic cargo build --release --features bake_dictionary

4. Install

sudachi.rs/ $ cargo install --path sudachi-cli/

$ which sudachi
/Users/<USER>/.cargo/bin/sudachi

$ sudachi -h
sudachi 0.6.0
A Japanese tokenizer
...

Usage as a command

$ sudachi -h
A Japanese tokenizer

Usage: sudachi [OPTIONS] [FILE] [COMMAND]

Commands:
  build
          Builds system dictionary
  ubuild
          Builds user dictionary
  dump

  help
          Print this message or the help of the given subcommand(s)

Arguments:
  [FILE]
          Input text file: If not present, read from STDIN

Options:
  -r, --config-file <CONFIG_FILE>
          Path to the setting file in JSON format
  -p, --resource_dir <RESOURCE_DIR>
          Path to the root directory of resources
  -m, --mode <MODE>
          Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]
  -o, --output <OUTPUT_FILE>
          Output text file: If not present, use stdout
  -a, --all
          Prints all fields
  -w, --wakati
          Outputs only surface form
  -d, --debug
          Debug mode: Print the debug information
  -l, --dict <DICTIONARY_PATH>
          Path to sudachi dictionary. If None, it refer config and then baked dictionary
      --split-sentences <SPLIT_SENTENCES>
          How to split sentences [default: yes]
  -h, --help
          Print help (see more with '--help')
  -V, --version
          Print version

Output

Columns are tab separated.

Surface
Part-of-Speech Tags (comma separated)
Normalized Form

When you add the -a (--all) flag, it additionally outputs

Dictionary Form
Reading Form
Dictionary ID
- 0 for the system dictionary
- 1 and above for the user dictionaries
- -1 if a word is Out-of-Vocabulary (not in the dictionary)
Synonym group IDs
(OOV) if a word is Out-of-Vocabulary (not in the dictionary)

$ echo "外国人参政権" | sudachi -a
外国人参政権    名詞,普通名詞,一般,*,*,*        外国人参政権    外国人参政権    ガイコクジンサンセイケン      0       []
EOS

echo "阿quei" | sudachipy -a
阿      名詞,普通名詞,一般,*,*,*        阿      阿              -1      []      (OOV)
quei    名詞,普通名詞,一般,*,*,*        quei    quei            -1      []      (OOV)
EOS

When you add -w (--wakati) flag, it outputs space-delimited surface instead.

$ echo "外国人参政権" | sudachi -m A -w
外国 人 参政 権

ToDo

Out of Vocabulary handling
Easy dictionary file install & management, similar to SudachiPy
Registration to crates.io

References

Sudachi

Morphological Analyzers in Rust

Logo

Sudachi Logo
Crab illustration: Pixabay

sudachi.rs's People

Contributors

Stargazers

Watchers

sudachi.rs's Issues

Stateful Utf8InputText

Part of #36

Rework (and eventually replace) Utf8InputText/Utf8InputTextBuilder so they

Can be reusable between analyses
Should reuse allocated buffers for successive analyses
Should expose codepoint-based API for iterating modified string slices

Simplify import path for python binding

Part of #25.

With current python binding implementation we need import sudachi.sudachi to import module.
Add __init__.py to arrange it.

ref: Hugging Face Tokenizers seems have a script to generate.

CI: make builds faster

Cache target and cargo dirs for quick builds
Do full daily (weekly?) builds if it is possible

Dependency Licenses

We need to go over licenses of our dependencies and check their compatibilities

Cargo.toml should contain URLs to package homepage (preferably github) and license for every dependency.

SudachiPy compatibility

Part of #25.
Provide same interface to SudachiPy as much as possible.

Maybe after rust API become stable (#28)

No license

This project has no license, which means (by default) users do not have rights to use it.

Fix steps:

Add license file to repo
- https://docs.github.com/en/github/building-a-strong-community/adding-a-license-to-a-repository
Add "license" field to Cargo.toml
- https://doc.rust-lang.org/cargo/reference/manifest.html#the-license-and-license-file-fields

I suggest Apache 2.0 license like Sudachi.

Allow plugins

Like Java / Python version, let users use/add plugins.
https://github.com/WorksApplications/Sudachi#plugins

Non-allocating Trie lookup

OOV Handling

Currently, no OOV handling at all.

Thus the tokenizer will sometimes panic when there is no word in the dictionary for the input.

e.g.,

$ echo Sudachi | sudachi
thread 'main' panicked at 'EOS isn't connected to BOS', src/lattice.rs:70:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

Add CI integration

Github Actions-based

Should build, run tests and examples (when they are working)

Move analyzer internals to analysis package

Right now some of them are in root, lattice is separated into subpackage.

Merge lattice and analyzer into a single package.

Decrease size of Node

Nodes are copied around a lot and Rust does not have placement semantic yet.
The easiest way to improve the analysis speed is to decrease the size of the Node structure.

It should be made at most 32 bytes (currently 264 bytes).
It means that most of the fields must go away and WordInfo must be moved out of the structure.

Disable dictionary version check

The dictionary version check: src/dic/header.rs line 17 may be not needed.

        assert_eq!(header.version, 0x7366d3f18bd111e7);

Reason

When I use sudachi-dictionary-20210608-core.zip, the built sudachi command failed.

$ echo "あ" | sudachi
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `14888620106744153140`,
 right: `8315566796372513255`', src/dic/header.rs:17:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

When I discommented src/dic/header.rs line 17,

        //assert_eq!(header.version, 0x7366d3f18bd111e7);

the rebuilt sudachi command run successfully.

$ echo "あ" | sudachi
あ	感動詞,フィラー,*,*,*,*	あー
EOS

Update nom from v4 to v7

Now we use nom to parse dictionary.

Plugin loader: Refactor duplicate code

Plugin-loader: Add variables

$bin - which resolves to the directory of the sudachi binary
$cfg - which resolves to the directory of the sudachi config (or config root?)

Re-implement baking dictionary into sudachi binary

But first we need to design this feature better, probably it does not make sense to load user dictionaries or configuration in this setting.

Huggingface Tokenisers Integration

Probably, the easiest way to the integration would be to use CustomPretokeniser hook in the python bindings.
https://github.com/huggingface/tokenizers/blob/master/bindings/python/src/pre_tokenizers.rs#L547

Good things:

Python will handle dynamic library loading for us

Bad things:

Some performance overhead on python method calls
Some performance overhead on string copying

Anything else would require significant additions into the Huggingface Tokeniser framework.
Let's prototype this integration and maybe switch to another one if the overhead of this one will be too high.

Plan:

Add create_pretokenizer() method of Dictionary class (Python Binding) which will return an object which is compatible with HuggingFace.

Depends on #166

Change Python API

relates to #52.

Some of the current python APIs seems not necessary (e.g. MorphemeList.split).
We may provide better way to do.

Public Rust API

We want to design public API so Sudachi would be usable like the following.
Syntax can be a bit invalid and all names are open for discussion.

let model = JapaneseModel::from_cfg("...")?;
let mut analyzer = model.new_analyzer();

for line in data {
  for sentence in analyzer.analyze(sentence)? {
    for token in sentence {
      println!(token.surface);
    }
  }
}

Key points of API

Model should be immutable and safe to share between threads
It contains dictionary, connection, needed const data for preprocessing
Analyzer contains mutable data for analysis, e.g. lattice and tries to reuse allocations as much as possible.
In the long time, analyzer should have O(1) allocations

Because of Python API and lifetime considerations, Model should be a thin wrapper on Arc<RealModel> or something like that.

Layering

We have Rust API and Python API with different lifetime considerations.
Rust API should use lifetimes to safeguard against misuse and use mostly references for sharing data. On the other hand Python can't use Rust lifetimes and should use mostly Arc for sharing data.

Design proposal here is to have pointer-generic internals with thin wrappers for API types which mostly exist for instantiating concrete types.

API Surface (Types)

Dictionary - stores immutable data for tokenization
Tokenizer - stores mutable state for tokenization
InputBuffer - handles zero-copy input, sentence splitting and streaming of input data (eventually)
MorphemeList - analysis result of a single block of input data
Morpheme - unit of analysis result

Serializable PyTokenizer

Part of #52, relates to #38.
Let PyTokenizer serializable so that we can save/load the model.

CLI IO - Change Impls from Box<dyn Reader> to BufReader<Box<dyn Reader>>

Making buffering easily inlinable; internal writes should not have much impact.

Same for readers.

Reorganize source code

Proposal for reorganization:

main folder: sudachi crate (rlib), does not produce binary, is library only
sudachi-cli folder: sudachi-cli (bin), produces sudachi executable
python: python bindings (cdylib), #25
plugins: contains individual plugins (cdylib) (for testing only, core plugins: #17)

Other nice things to have: API usage examples

Optimize connection cost lookup

This also takes ~30% of the whole analysis time.

First try on this should improve interaction with overriding connection const via plugins and inefficient lookup logic.
Second try should completely remove all hashmaps from connection cost lookup.

Fix getitem of PyMorphemeList

Current PyMorphemeList's index access is wrong:

morphemes = tokenizer.tokenize("東京都", SplitMode.A)
len(morphemes)  # == 2
morphemes[-3].surface()  # returns "都", should raise error

pyo3 automatically handles negative index when we implement __getitem__ and we should not handle index by ourselves.

Python binding

Create sudachipy conpatible python binding

failed to fetch_dictionary.sh

Because of lack of src/resources, fetch_dictionary.sh fails to move the dictionary file to the directory.
Putting an empty .keep file into the directory will resolve this problem.

Make plugins to be built automatically without specifying --all

Can be done by specifying dependency as

plugin = { path = "path/to/plugin" }

https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-path-dependencies

Look into MeCab provider tests

Generally, we should not expect panics from test, we should check exact errors from them.

Fix clippy warnings

Speedup Weight Lookup

Lookup must be a single memory access.
Hashmaps are too heavy for checking.
In case of inhibitconnection/weight editing plugin we will make a full copy of weight matrix and modify it in place.

Stateful Tokeniser

Part of #36

It should reuse allocations between analyses.
Parts:

CI - add sanity check that the main binary works

Interactive usage does not work

Sudachi binary outputs nothing until you pass it end of file

Plugin loader: check compiler version

Rust ABI is unstable and using plugins compiled with different compiler version can lead to UB

AnalyzedSentence Design

Prereq of #52

We want a ponter-generic version of return container for the result of analysis.

Rust API wants references/mut references; Python binding wants Arc.

Easy installable dictionary

Remove lifetime parameter from Morpheme

To use Morpheme in the python binding (#25), we need to remove lifetime parameter from it.

Make CharacterCategory a bitset

Part of #36
Related to #53

Rightnow it is a hashset of enums which is super inefficient compared to bitsets.

Making it a bitset will allow to make computational complexity of many algorithms to O(1) from O(n) as with hashmaps and remove allocations

Registration to crates.io

"İ" does not behave the same as the Java version of Sudachi.

I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.

$ echo "İstanbul" | sudachipy -a
İ       名詞,普通名詞,一般,*,*,*        I       I       アイ    0       []
        補助記号,一般,*,*,*,*   ̇       ̇               -1      []      (OOV)
stanbul 名詞,普通名詞,一般,*,*,*        stanbul stanbul         -1      []      (OOV)
EOS

$ echo "İstanbul" | sudachi -a
İstanbul        名詞,固有名詞,一般,*,*,*        Istanbul        Istanbul        Istanbul        0       [15600]
EOS

Apparently, the character normalization process is passing different input to each sudachi.

$ echo "İstanbul" | sudachipy -d
=== Inupt dump:
i(U+0307)stanbul

$ echo "İstanbul" | sudachi -d
=== Input dump:
istanbul

It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java.
This may be due to the lower specification of each programming language.

Remove allocations from the analysis hotpath

Now malloc/free take ~30% of whole analysis time on Linux.

Add sentence splitter

ref: https://github.com/WorksApplications/Sudachi/tree/develop/src/main/java/com/worksap/nlp/sudachi/sentdetect

User-facing sentence splitter

As per #28, we want to decouple analysis from sentence splitting.

This issue tracks creating sentence splitter as API.

Trie: check usize usage

Right now tries use usize as internal unit which seems very incorrect because it has platform dependent size.

Change dictionary download URL

Currently, fetch_dictionary.sh seems not work.
https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/${DICT_NAME}.zip can't be accessed because of the authentication.

$ ./fetch_dictionary.sh 
Downloading a dictionary file `sudachi-dictionary-20200722-core` ...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   131  100   131    0     0    521      0 --:--:-- --:--:-- --:--:--   521
Archive:  sudachi-dictionary-20200722-core.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of sudachi-dictionary-20200722-core.zip or
        sudachi-dictionary-20200722-core.zip.zip, and cannot find sudachi-dictionary-20200722-core.zip.ZIP, period.
mv: 'sudachi-dictionary-20200722/system_core.dic' を stat できません: そのようなファイルやディレクトリはありません

Placed a dictionary file to `src/resources/system.dic` .

When I changed the download URL to http://sudachi.s3-website-ap-northeast-1.amazonaws.com/sudachidict/${DICT_NAME}.zip,
fetch_dictionary.sh works successfully!

$ ./fetch_dictionary.sh 
Downloading a dictionary file `sudachi-dictionary-20200722-core` ...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 67.8M  100 67.8M    0     0  2899k      0  0:00:23  0:00:23 --:--:-- 3121k
Archive:  sudachi-dictionary-20200722-core.zip
   creating: sudachi-dictionary-20200722/
  inflating: sudachi-dictionary-20200722/system_core.dic  
  inflating: sudachi-dictionary-20200722/LEGAL  
  inflating: sudachi-dictionary-20200722/LICENSE-2.0.txt  

Placed a dictionary file to `src/resources/system.dic` .