Giter VIP home page Giter VIP logo

vaporetto's Introduction

🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer

Vaporetto is a fast and lightweight pointwise prediction-based tokenizer. This repository includes both a Rust crate that provides APIs for Vaporetto and CLI frontends.

Crates.io Documentation Build Status Slack

日本語のドキュメント

Wasm Demo (takes a little time to load the model.)

A Python wrapper is also available here.

Example Usage

Try Word Segmentation

This software is implemented in Rust. Install rustc and cargo following the documentation beforehand.

Vaporetto provides three ways to generate tokenization models:

Download Distribution Model

The first is the simplest way, which is to download a model we have trained. Models are available here.

We chose bccwj-suw+unidic_pos+pron:

% wget https://github.com/daac-tools/vaporetto-models/releases/download/v0.5.0/bccwj-suw+unidic_pos+pron.tar.xz

Each file is a compressed file containing a model file and license terms, so you need to decompress the downloaded file as shown in the following command:

% tar xf ./bccwj-suw+unidic_pos+pron.tar.xz

To perform tokenization, run the following command:

% echo 'ヴェネツィアはイタリアにあります。' | cargo run --release -p predict -- --model path/to/bccwj-suw+unidic_pos+pron.model.zst

The following will be output:

ヴェネツィア は イタリア に あり ます 。
Notes for Vaporetto APIs

The distribution models are compressed in the zstd format. If you want to load these compressed models with the vaporetto API, you must decompress them outside of the API.

// Requires zstd crate or ruzstd crate
let reader = zstd::Decoder::new(File::open("path/to/model.zst")?)?;
let model = Model::read(reader)?;

You can also decompress the file using the unzstd command, which is bundled with modern Linux distributions.

Convert KyTea's Model

The second is also a simple way, which is to convert a model trained by KyTea. First of all, download the model of your choice from the KyTea Models page.

We chose jp-0.4.7-5.mod.gz:

% wget http://www.phontron.com/kytea/download/model/jp-0.4.7-5.mod.gz

Each file is a compressed file, so you need to decompress the downloaded model file as shown in the following command:

% gunzip ./jp-0.4.7-5.mod.gz

To convert a KyTea model into a Vaporetto model, run the following command in the Vaporetto root directory.

% cargo run --release -p convert_kytea_model -- --model-in path/to/jp-0.4.7-5.mod --model-out path/to/jp-0.4.7-5-tokenize.model.zst

Now you can perform tokenization. Run the following command:

% echo 'ヴェネツィアはイタリアにあります。' | cargo run --release -p predict -- --model path/to/jp-0.4.7-5-tokenize.model.zst

The following will be output:

ヴェネツィア は イタリア に あ り ま す 。

Train Your Model

The third way, which is mainly for researchers, is to prepare a training corpus and train your tokenization models.

Vaporetto can train from two types of corpora: fully annotated corpora and partially annotated corpora.

Fully annotated corpora are corpora in which all character boundaries are annotated with either token boundaries or internal positions of tokens. This is the data in the form of spaces inserted into the boundaries of the tokens, as shown below:

ヴェネツィア は イタリア に あり ます 。
火星 猫 の 生態 の 調査 結果

Besides, partially annotated corpora are corpora in which only some character boundaries are annotated. Each character boundary is annotated in the form of | (token boundary), - (not token boundary), and (unknown). Here is an example:

ヴ-ェ-ネ-ツ-ィ-ア|は|イ-タ-リ-ア|に|あ り ま す|。
火-星 猫|の|生-態|の|調-査 結-果

To train a model, use the following command:

% cargo run --release -p train -- --model ./your.model.zst --tok path/to/full.txt --part path/to/part.txt --dict path/to/dict.txt --solver 5

The --tok argument specifies a fully annotated corpus, and the --part argument specifies a partially annotated corpus. You can also specify a word dictionary with the --dict argument. A word dictionary is a file that lists words line by line and can be tagged as needed:

トスカーナ
パンツァーノ
灯里/名詞-固有名詞-人名-名/アカリ
形態/名詞-普通名詞-一般/ケータイ

The trainer does not accept empty lines. Therefore, remove all empty lines from the corpus before training.

You can specify all arguments above multiple times.

Model Manipulation

Sometimes, your model will output different results than what you expect. For example, 外国人参政権 is split into wrong tokens in the following command. We use the --scores option to show the score of each character boundary:

% echo '外国人参政権と政権交代' | cargo run --release -p predict -- --scores --model path/to/bccwj-suw+unidic_pos+pron.model.zst
外国 人 参 政権 と 政権 交代
0:外国 -10784
1:国人 17935
2:人参 5308
3:参政 3833
4:政権 -3299
5:権と 14635
6:と政 17653
7:政権 -12705
8:権交 11611
9:交代 -5794

The correct is 外国 人 参政 権. To split 外国人参政権 into correct tokens, manipulate the model in the following steps so that the sign of score of 参政権 becomes inverted:

  1. Dump a dictionary by the following command:

    % cargo run --release -p manipulate_model -- --model-in path/to/bccwj-suw+unidic_pos+pron.model.zst --dump-dict path/to/dictionary.csv
    
  2. Edit the dictionary.

    The dictionary is a CSV file. Each row contains a string pattern, a corresponding weight array, and a comment in the following order:

    • word - A string pattern (usually a word)
    • weights - A weight array. When the input string contains the pattern, these weights are added to the character boundaries of the range of the pattern found.
    • comment - A comment that does not affect the behavior.

    Vaporetto splits a text when the total weight of the boundary is a positive number, so we add a new entry as follows:

     参撾,3328 -5545 3514,
     参政,3328 -5545 3514,
    +参政権,0 -10000 10000 0,参政/権
     参朝,3328 -5545 3514,
     参校,3328 -5545 3514,

    In this case, -10000 will be added between and , and 10000 will be added between and . Because 0 is specified at both ends of the pattern, no scores are added at those positions.

    Note that Vaporetto uses 32-bit integers for the total weight, so you have to be careful about overflow.

    In addition, The dictionary cannot contain duplicated words. When the dictionary already contains the word, you have to edit existing weights.

  3. Replaces weight data of a model file

    % cargo run --release -p manipulate_model -- --model-in path/to/bccwj-suw+unidic_pos+pron.model.zst --replace-dict path/to/dictionary.csv --model-out path/to/bccwj-suw+unidic_pos+pron-new.model.zst
    

Now 外国人参政権 is split into correct tokens.

% echo '外国人参政権と政権交代' | cargo run --release -p predict -- --scores --model path/to/bccwj-suw+unidic_pos+pron-new.model.zst
外国 人 参政 権 と 政権 交代
0:外国 -10784
1:国人 17935
2:人参 5308
3:参政 -6167
4:政権 6701
5:権と 14635
6:と政 17653
7:政権 -12705
8:権交 11611
9:交代 -5794

Tag prediction

Vaporetto experimentally supports tagging (e.g., part-of-speech and pronunciation tags).

To train tags, add slashes and tags following each token in the dataset as follows:

  • For fully annotated corpora

    この/連体詞/コノ 人/名詞/ヒト は/助詞/ワ 火星/名詞/カセイ 人/接尾辞/ジン です/助動詞/デス
    
  • For partially annotated corpora

    ヴ-ェ-ネ-ツ-ィ-ア/名詞|は/助詞|イ-タ-リ-ア/名詞|に/助詞|あ-り ま-す
    

You can also specify tag information to dictionaries as well as corpora. When the predictor cannot predict a tag using the model, the tag specified in the dictionary will be annotated to the token.

If the dataset contains tags, the train command automatically trains them.

In prediction, tags are not predicted by default, so you have to specify the --predict-tags argument to the predict command if necessary.

If you specify the --tag-scores argument, the score of each candidate calculated during tag prediction is displayed. If there is only one candidate, the score becomes 0.

% echo "花が咲く" | cargo run --release -p predict -- --model path/to/bccwj-suw+unidic_pos+pron.model.zst --predict-tags --tag-scores
花/名詞-普通名詞-一般/ハナ が/助詞-格助詞/ガ 咲く/動詞-一般/サク
花	名詞-普通名詞-一般:18613,接尾辞-名詞的-一般:-18613	ハナ:19973,バナ:-20377,カ:-20480,ゲ:-20410
が	助詞-接続助詞:-20408,助詞-格助詞:23543,接続詞:-25332	ガ:0
咲く	動詞-一般:0	サク:0

Speed Comparison of Various Tokenizers

Vaporetto is 8.7 times faster than KyTea.

Details can be found here.

Slack

We have a Slack workspace for developers and users to ask questions and discuss a variety of topics.

License

Licensed under either of

at your option.

Contribution

See the guidelines.

References

Technical details of Vaporetto are available in the following paper or the blog post:

vaporetto's People

Contributors

akirakubo avatar dependabot[bot] avatar kampersanda avatar odashi avatar pseitz avatar vbkaisetsu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

vaporetto's Issues

error: The following required arguments were not provided: --model-out <model-out>

README.md says:

%  cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5-tokenize.model.zstd

but this happens:

# cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5-tokenize.model.zstd
    Updating crates.io index
  Downloaded cc v1.0.72
  Downloaded structopt-derive v0.4.18
  Downloaded quote v1.0.15
  Downloaded proc-macro2 v1.0.36
  Downloaded proc-macro-error v1.0.4
  Downloaded syn v1.0.86
  Downloaded bincode v1.3.3
  Downloaded anyhow v1.0.53
  Downloaded bitflags v1.3.2
  Downloaded ansi_term v0.12.1
  Downloaded unicode-segmentation v1.9.0
  Downloaded vec_map v0.8.2
  Downloaded jobserver v0.1.24
  Downloaded zstd v0.9.2+zstd.1.5.1
  Downloaded textwrap v0.11.0
  Downloaded heck v0.3.3
  Downloaded zstd-sys v1.6.2+zstd.1.5.1
  Downloaded libc v0.2.117
  Downloaded lazy_static v1.4.0
  Downloaded clap v2.34.0
  Downloaded structopt v0.3.26
  Downloaded zstd-safe v4.1.3+zstd.1.5.1
  Downloaded unicode-width v0.1.9
  Downloaded strsim v0.8.0
  Downloaded serde_derive v1.0.136
  Downloaded proc-macro-error-attr v1.0.4
  Downloaded version_check v0.9.4
  Downloaded unicode-xid v0.2.2
  Downloaded serde v1.0.136
  Downloaded atty v0.2.14
  Downloaded byteorder v1.4.3
  Downloaded daachorse v0.2.1
  Downloaded 32 crates (2.5 MB) in 1.12s
   Compiling libc v0.2.117
   Compiling proc-macro2 v1.0.36
   Compiling unicode-xid v0.2.2
   Compiling syn v1.0.86
   Compiling version_check v0.9.4
   Compiling serde_derive v1.0.136
   Compiling serde v1.0.136
   Compiling anyhow v1.0.53
   Compiling zstd-safe v4.1.3+zstd.1.5.1
   Compiling unicode-segmentation v1.9.0
   Compiling unicode-width v0.1.9
   Compiling bitflags v1.3.2
   Compiling byteorder v1.4.3
   Compiling strsim v0.8.0
   Compiling ansi_term v0.12.1
   Compiling vec_map v0.8.2
   Compiling lazy_static v1.4.0
   Compiling textwrap v0.11.0
   Compiling daachorse v0.2.1
   Compiling heck v0.3.3
   Compiling proc-macro-error-attr v1.0.4
   Compiling proc-macro-error v1.0.4
   Compiling quote v1.0.15
   Compiling atty v0.2.14
   Compiling jobserver v0.1.24
   Compiling clap v2.34.0
   Compiling cc v1.0.72
   Compiling zstd-sys v1.6.2+zstd.1.5.1
   Compiling structopt-derive v0.4.18
   Compiling structopt v0.3.26
   Compiling zstd v0.9.2+zstd.1.5.1
   Compiling bincode v1.3.3
   Compiling vaporetto v0.2.0 (/work/vae_experiments/vaporetto/vaporetto)
   Compiling convert_kytea_model v0.1.0 (/work/vae_experiments/vaporetto/convert_kytea_model)
    Finished release [optimized] target(s) in 1m 14s
     Running `target/release/convert_kytea_model --model-in jp-0.4.7-5-tokenize.model.zstd`
error: The following required arguments were not provided:
    --model-out <model-out>

USAGE:
    convert_kytea_model --model-in <model-in> --model-out <model-out>

I think, the correct command is:

% cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5.mod --model-out jp-0.4.7-5-tokenize.model.zstd

Cannot deserialize predictor

I want to serialize & deserialize predictor to persistent it in file to reduce model building time when used as cli tool.

But I got DecodeError(UnexpectedEnd { additional: 1 }) error when deserializing.

However, when changing the parameter of predict_tags to false, deserializing works, but I need the yomi of each token.

Is it a bug or I missed something?

Here's my code:

use std::fs::File;
use std::io::Read;
use vaporetto::{Model, Predictor};

fn main() {
    let file = File::open("./model.zst").unwrap();
    let mut decoder = ruzstd::StreamingDecoder::new(file).unwrap();
    let mut buffer = vec![];
    decoder.read_to_end(&mut buffer).unwrap();
    let (model, _) = Model::read_slice(&buffer).unwrap();

    // when predict_tags set to false, it works
    let predictor = Predictor::new(model, true).unwrap();

    let serialized = predictor.serialize_to_vec().unwrap();
    unsafe { Predictor::deserialize_from_slice_unchecked(&serialized).unwrap(); }
    // DecodeError(UnexpectedEnd { additional: 1 })
}

Error: InvalidModel(InvalidModelError { msg: "unsupported character type: 4" })

I need Chinese model and found some in http://www.phontron.com/kytea/download/model/as-0.4.0-1.mod.gz , all the models found in this site can't be converted.

macOS 13.4.1 with M1 chip

cargo run --release -p convert_kytea_model -- --model-in as-0.4.0-1.mod --model
-out as-0.4.0-1.mod.zst
    Finished release [optimized] target(s) in 0.14s
     Running `target/release/convert_kytea_model --model-in as-0.4.0-1.mod --model-out as-0.4.0-1.mod.zst`
Loading model file...
Saving model file...
Error: InvalidModel(InvalidModelError { msg: "unsupported character type: 4" })

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.