Giter VIP home page Giter VIP logo

sudachi.rs's Introduction

sudachi.rs - English README

Rust

2023-12-14 UPDATE: 0.6.8 Release

Try it:

pip install --upgrade 'sudachipy>=0.6.8'

sudachi.rs logo

sudachi.rs is a Rust implementation of Sudachi, a Japanese morphological analyzer.

日本語 README SudachiPy Documentation

TL;DR

$ git clone https://github.com/WorksApplications/sudachi.rs.git
$ cd ./sudachi.rs

$ cargo build --release
$ cargo install --path sudachi-cli/
$ ./fetch_dictionary.sh

$ echo "高輪ゲートウェイ駅" | sudachi
高輪ゲートウェイ駅  名詞,固有名詞,一般,*,*,*    高輪ゲートウェイ駅
EOS

Example

Multi-granular Tokenization

$ echo 選挙管理委員会 | sudachi
選挙管理委員会  名詞,固有名詞,一般,*,*,*        選挙管理委員会
EOS

$ echo 選挙管理委員会 | sudachi --mode A
選挙    名詞,普通名詞,サ変可能,*,*,*    選挙
管理    名詞,普通名詞,サ変可能,*,*,*    管理
委員    名詞,普通名詞,一般,*,*,*        委員
会      名詞,普通名詞,一般,*,*,*        会
EOS

Normalized Form

$ echo 打込む かつ丼 附属 vintage | sudachi
打込む  動詞,一般,*,*,五段-マ行,終止形-一般     打ち込む
        空白,*,*,*,*,*
かつ丼  名詞,普通名詞,一般,*,*,*        カツ丼
        空白,*,*,*,*,*
附属    名詞,普通名詞,サ変可能,*,*,*    付属
        空白,*,*,*,*,*
vintage 名詞,普通名詞,一般,*,*,*        ビンテージ
EOS

Wakati (space-delimited surface form) Output

$ cat lemon.txt
えたいの知れない不吉な塊が私の心を始終圧えつけていた。
焦躁と言おうか、嫌悪と言おうか――酒を飲んだあとに宿酔があるように、酒を毎日飲んでいると宿酔に相当した時期がやって来る。
それが来たのだ。これはちょっといけなかった。

$ sudachi --wakati lemon.txt
えたい の 知れ ない 不吉 な 塊 が 私 の 心 を 始終 圧え つけ て い た 。
焦躁 と 言おう か 、 嫌悪 と 言おう か ― ― 酒 を 飲ん だ あと に 宿酔 が ある よう に 、 酒 を 毎日 飲ん で いる と 宿酔 に 相当 し た 時期 が やっ て 来る 。
それ が 来 た の だ 。 これ は ちょっと いけ なかっ た 。

Setup

You need sudachi.rs, default plugins, and a dictionary. (This crate don't include dictionary.)

1. Get the source code

$ git clone https://github.com/WorksApplications/sudachi.rs.git

2. Download a Sudachi Dictionary

Sudachi requires a dictionary to operate. You can download a dictionary ZIP file from WorksApplications/SudachiDict (choose one from small, core, or full), unzip it, and place the system_*.dic file somewhere. By the default setting file, sudachi.rs assumes that it is placed at resources/system.dic.

Convenience Script

Optionally, you can use the fetch_dictionary.sh shell script to download a dictionary and install it to resources/system.dic.

$ ./fetch_dictionary.sh

3. Build

$ cargo build --release

Build (bake dictionary into binary)

This was un-implemented and does not work currently, see #35

Specify the bake_dictionary feature to embed a dictionary into the binary. The sudachi executable will contain the dictionary binary. The baked dictionary will be used if no one is specified via cli option or setting file.

You must specify the path the dictionary file in the SUDACHI_DICT_PATH environment variable when building. SUDACHI_DICT_PATH is relative to the sudachi.rs directory (or absolute).

Example on Unix-like system:

# Download dictionary to resources/system.dic
$ ./fetch_dictionary.sh

# Build with bake_dictionary feature (relative path)
$ env SUDACHI_DICT_PATH=resources/system.dic cargo build --release --features bake_dictionary

# or

# Build with bake_dictionary feature (absolute path)
$ env SUDACHI_DICT_PATH=/path/to/my-sudachi.dic cargo build --release --features bake_dictionary

4. Install

sudachi.rs/ $ cargo install --path sudachi-cli/

$ which sudachi
/Users/<USER>/.cargo/bin/sudachi

$ sudachi -h
sudachi 0.6.0
A Japanese tokenizer
...

Usage as a command

$ sudachi -h
sudachi 0.6.0
A Japanese tokenizer

USAGE:
    sudachi [FLAGS] [OPTIONS] [file]

FLAGS:
    -d, --debug      Debug mode: Print the debug information
    -h, --help       Prints help information
    -a, --all        Prints all fields
    -V, --version    Prints version information
    -w, --wakati     Outputs only surface form

OPTIONS:
    -r, --config-file <config-file>      Path to the setting file in JSON format
    -l, --dict <dictionary-path>         Path to sudachi dictionary. If None, it refer config and then baked dictionary
    -m, --mode <mode>                    Split unit: "A" (short), "B" (middle), or "C" (Named Entity) [default: C]
    -o, --output <output-file>
    -p, --resource_dir <resource-dir>    Path to the root directory of resources

ARGS:
    <file>    Input text file: If not present, read from STDIN

Output

Columns are tab separated.

  • Surface
  • Part-of-Speech Tags (comma separated)
  • Normalized Form

When you add the -a (--all) flag, it additionally outputs

  • Dictionary Form
  • Reading Form
  • Dictionary ID
    • 0 for the system dictionary
    • 1 and above for the user dictionaries
    • -1 if a word is Out-of-Vocabulary (not in the dictionary)
  • Synonym group IDs
  • (OOV) if a word is Out-of-Vocabulary (not in the dictionary)
$ echo "外国人参政権" | sudachi -a
外国人参政権    名詞,普通名詞,一般,*,*,*        外国人参政権    外国人参政権    ガイコクジンサンセイケン      0       []
EOS
echo "阿quei" | sudachipy -a
阿      名詞,普通名詞,一般,*,*,*        阿      阿              -1      []      (OOV)
quei    名詞,普通名詞,一般,*,*,*        quei    quei            -1      []      (OOV)
EOS

When you add -w (--wakati) flag, it outputs space-delimited surface instead.

$ echo "外国人参政権" | sudachi -m A -w
外国 人 参政 権

ToDo

  • Out of Vocabulary handling
  • Easy dictionary file install & management, similar to SudachiPy
  • Registration to crates.io

References

Sudachi

Morphological Analyzers in Rust

Logo

sudachi.rs's People

Contributors

bignumorg avatar eiennohito avatar hata6502 avatar kazuma-t avatar mh-northlander avatar sorami avatar tmfink avatar yokomotod avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sudachi.rs's Issues

Public Rust API

We want to design public API so Sudachi would be usable like the following.
Syntax can be a bit invalid and all names are open for discussion.

let model = JapaneseModel::from_cfg("...")?;
let mut analyzer = model.new_analyzer();

for line in data {
  for sentence in analyzer.analyze(sentence)? {
    for token in sentence {
      println!(token.surface);
    }
  }
}

Key points of API

  • Model should be immutable and safe to share between threads
  • It contains dictionary, connection, needed const data for preprocessing
  • Analyzer contains mutable data for analysis, e.g. lattice and tries to reuse allocations as much as possible.
  • In the long time, analyzer should have O(1) allocations

Because of Python API and lifetime considerations, Model should be a thin wrapper on Arc<RealModel> or something like that.

Layering

We have Rust API and Python API with different lifetime considerations.
Rust API should use lifetimes to safeguard against misuse and use mostly references for sharing data. On the other hand Python can't use Rust lifetimes and should use mostly Arc for sharing data.

Design proposal here is to have pointer-generic internals with thin wrappers for API types which mostly exist for instantiating concrete types.

API Surface (Types)

  • Dictionary - stores immutable data for tokenization
  • Tokenizer - stores mutable state for tokenization
  • InputBuffer - handles zero-copy input, sentence splitting and streaming of input data (eventually)
  • MorphemeList - analysis result of a single block of input data
  • Morpheme - unit of analysis result

AnalyzedSentence Design

Prereq of #52

We want a ponter-generic version of return container for the result of analysis.

Rust API wants references/mut references; Python binding wants Arc.

Reorganize source code

Proposal for reorganization:

  • main folder: sudachi crate (rlib), does not produce binary, is library only
  • sudachi-cli folder: sudachi-cli (bin), produces sudachi executable
  • python: python bindings (cdylib), #25
  • plugins: contains individual plugins (cdylib) (for testing only, core plugins: #17)

Other nice things to have: API usage examples

Disable dictionary version check

The dictionary version check: src/dic/header.rs line 17 may be not needed.

        assert_eq!(header.version, 0x7366d3f18bd111e7);

Reason

When I use sudachi-dictionary-20210608-core.zip, the built sudachi command failed.

$ echo "" | sudachi
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `14888620106744153140`,
 right: `8315566796372513255`', src/dic/header.rs:17:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

When I discommented src/dic/header.rs line 17,

        //assert_eq!(header.version, 0x7366d3f18bd111e7);

the rebuilt sudachi command run successfully.

$ echo "" | sudachi
あ	感動詞,フィラー,*,*,*,*	あー
EOS

Dependency Licenses

We need to go over licenses of our dependencies and check their compatibilities

Cargo.toml should contain URLs to package homepage (preferably github) and license for every dependency.

Optimize connection cost lookup

This also takes ~30% of the whole analysis time.

First try on this should improve interaction with overriding connection const via plugins and inefficient lookup logic.
Second try should completely remove all hashmaps from connection cost lookup.

Speedup Weight Lookup

  1. Lookup must be a single memory access.
  2. Hashmaps are too heavy for checking.
  3. In case of inhibitconnection/weight editing plugin we will make a full copy of weight matrix and modify it in place.

Add CI integration

Github Actions-based

Should build, run tests and examples (when they are working)

OOV Handling

Currently, no OOV handling at all.

Thus the tokenizer will sometimes panic when there is no word in the dictionary for the input.

e.g.,

$ echo Sudachi | sudachi
thread 'main' panicked at 'EOS isn't connected to BOS', src/lattice.rs:70:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

"İ" does not behave the same as the Java version of Sudachi.

I have noticed that Sudachi Py and Sudachi behave differently because "İstanbul" is not recognized as a single token in SudachiPy, so I will report it.

$ echo "İstanbul" | sudachipy -a
İ       名詞,普通名詞,一般,*,*,*        I       I       アイ    0       []
        補助記号,一般,*,*,*,*   ̇       ̇               -1      []      (OOV)
stanbul 名詞,普通名詞,一般,*,*,*        stanbul stanbul         -1      []      (OOV)
EOS

$ echo "İstanbul" | sudachi -a
İstanbul        名詞,固有名詞,一般,*,*,*        Istanbul        Istanbul        Istanbul        0       [15600]
EOS

Apparently, the character normalization process is passing different input to each sudachi.

$ echo "İstanbul" | sudachipy -d
=== Inupt dump:
i(U+0307)stanbul

$ echo "İstanbul" | sudachi -d
=== Input dump:
istanbul

It seems that "İ (U + 0130)" is converted to "i (U + 0069)-◌̇ (U + 0307)" in python and "i (U + 0069)" in java.
This may be due to the lower specification of each programming language.

Trie: check usize usage

Right now tries use usize as internal unit which seems very incorrect because it has platform dependent size.

Make CharacterCategory a bitset

Part of #36
Related to #53

Rightnow it is a hashset of enums which is super inefficient compared to bitsets.

Making it a bitset will allow to make computational complexity of many algorithms to O(1) from O(n) as with hashmaps and remove allocations

Change dictionary download URL

Currently, fetch_dictionary.sh seems not work.
https://object-storage.tyo2.conoha.io/v1/nc_2520839e1f9641b08211a5c85243124a/sudachi/${DICT_NAME}.zip can't be accessed because of the authentication.

$ ./fetch_dictionary.sh 
Downloading a dictionary file `sudachi-dictionary-20200722-core` ...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   131  100   131    0     0    521      0 --:--:-- --:--:-- --:--:--   521
Archive:  sudachi-dictionary-20200722-core.zip
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of sudachi-dictionary-20200722-core.zip or
        sudachi-dictionary-20200722-core.zip.zip, and cannot find sudachi-dictionary-20200722-core.zip.ZIP, period.
mv: 'sudachi-dictionary-20200722/system_core.dic' を stat できません: そのようなファイルやディレクトリはありません

Placed a dictionary file to `src/resources/system.dic` .

When I changed the download URL to http://sudachi.s3-website-ap-northeast-1.amazonaws.com/sudachidict/${DICT_NAME}.zip,
fetch_dictionary.sh works successfully!

$ ./fetch_dictionary.sh 
Downloading a dictionary file `sudachi-dictionary-20200722-core` ...

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 67.8M  100 67.8M    0     0  2899k      0  0:00:23  0:00:23 --:--:-- 3121k
Archive:  sudachi-dictionary-20200722-core.zip
   creating: sudachi-dictionary-20200722/
  inflating: sudachi-dictionary-20200722/system_core.dic  
  inflating: sudachi-dictionary-20200722/LEGAL  
  inflating: sudachi-dictionary-20200722/LICENSE-2.0.txt  

Placed a dictionary file to `src/resources/system.dic` .

Plugin-loader: Add variables

$bin - which resolves to the directory of the sudachi binary
$cfg - which resolves to the directory of the sudachi config (or config root?)

Add build-dictionary mode

Current sudachi.rs cannot build both system and user dictionary.
We want to a cli tool for that feature as java/python

We can still use dictionaries build by java/python version

Change Python API

relates to #52.

Some of the current python APIs seems not necessary (e.g. MorphemeList.split).
We may provide better way to do.

CI: make builds faster

  1. Cache target and cargo dirs for quick builds
  2. Do full daily (weekly?) builds if it is possible

Stateful Utf8InputText

Part of #36

Rework (and eventually replace) Utf8InputText/Utf8InputTextBuilder so they

  1. Can be reusable between analyses
  2. Should reuse allocated buffers for successive analyses
  3. Should expose codepoint-based API for iterating modified string slices

Decrease size of Node

Nodes are copied around a lot and Rust does not have placement semantic yet.
The easiest way to improve the analysis speed is to decrease the size of the Node structure.

It should be made at most 32 bytes (currently 264 bytes).
It means that most of the fields must go away and WordInfo must be moved out of the structure.

failed to fetch_dictionary.sh

Because of lack of src/resources, fetch_dictionary.sh fails to move the dictionary file to the directory.
Putting an empty .keep file into the directory will resolve this problem.

Fix __getitem__ of PyMorphemeList

Current PyMorphemeList's index access is wrong:

morphemes = tokenizer.tokenize("東京都", SplitMode.A)
len(morphemes)  # == 2
morphemes[-3].surface()  # returns "都", should raise error

pyo3 automatically handles negative index when we implement __getitem__ and we should not handle index by ourselves.

Huggingface Tokenisers Integration

Probably, the easiest way to the integration would be to use CustomPretokeniser hook in the python bindings.
https://github.com/huggingface/tokenizers/blob/master/bindings/python/src/pre_tokenizers.rs#L547

Good things:

  • Python will handle dynamic library loading for us

Bad things:

  • Some performance overhead on python method calls
  • Some performance overhead on string copying

Anything else would require significant additions into the Huggingface Tokeniser framework.
Let's prototype this integration and maybe switch to another one if the overhead of this one will be too high.

Plan:

Add create_pretokenizer() method of Dictionary class (Python Binding) which will return an object which is compatible with HuggingFace.

Depends on #166

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.