daac-tools / vaporetto Goto Github PK
View Code? Open in Web Editor NEW🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Home Page: https://docs.rs/vaporetto
License: Apache License 2.0
🛥 Vaporetto: Very accelerated pointwise prediction based tokenizer
Home Page: https://docs.rs/vaporetto
License: Apache License 2.0
Is there any plan to add 読みがな
tag?
Pronunciation tag is good, but in some cases it would be good to have a yomi tag.
I need Chinese model and found some in http://www.phontron.com/kytea/download/model/as-0.4.0-1.mod.gz , all the models found in this site can't be converted.
macOS 13.4.1 with M1 chip
cargo run --release -p convert_kytea_model -- --model-in as-0.4.0-1.mod --model
-out as-0.4.0-1.mod.zst
Finished release [optimized] target(s) in 0.14s
Running `target/release/convert_kytea_model --model-in as-0.4.0-1.mod --model-out as-0.4.0-1.mod.zst`
Loading model file...
Saving model file...
Error: InvalidModel(InvalidModelError { msg: "unsupported character type: 4" })
README.md says:
% cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5-tokenize.model.zstd
but this happens:
# cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5-tokenize.model.zstd
Updating crates.io index
Downloaded cc v1.0.72
Downloaded structopt-derive v0.4.18
Downloaded quote v1.0.15
Downloaded proc-macro2 v1.0.36
Downloaded proc-macro-error v1.0.4
Downloaded syn v1.0.86
Downloaded bincode v1.3.3
Downloaded anyhow v1.0.53
Downloaded bitflags v1.3.2
Downloaded ansi_term v0.12.1
Downloaded unicode-segmentation v1.9.0
Downloaded vec_map v0.8.2
Downloaded jobserver v0.1.24
Downloaded zstd v0.9.2+zstd.1.5.1
Downloaded textwrap v0.11.0
Downloaded heck v0.3.3
Downloaded zstd-sys v1.6.2+zstd.1.5.1
Downloaded libc v0.2.117
Downloaded lazy_static v1.4.0
Downloaded clap v2.34.0
Downloaded structopt v0.3.26
Downloaded zstd-safe v4.1.3+zstd.1.5.1
Downloaded unicode-width v0.1.9
Downloaded strsim v0.8.0
Downloaded serde_derive v1.0.136
Downloaded proc-macro-error-attr v1.0.4
Downloaded version_check v0.9.4
Downloaded unicode-xid v0.2.2
Downloaded serde v1.0.136
Downloaded atty v0.2.14
Downloaded byteorder v1.4.3
Downloaded daachorse v0.2.1
Downloaded 32 crates (2.5 MB) in 1.12s
Compiling libc v0.2.117
Compiling proc-macro2 v1.0.36
Compiling unicode-xid v0.2.2
Compiling syn v1.0.86
Compiling version_check v0.9.4
Compiling serde_derive v1.0.136
Compiling serde v1.0.136
Compiling anyhow v1.0.53
Compiling zstd-safe v4.1.3+zstd.1.5.1
Compiling unicode-segmentation v1.9.0
Compiling unicode-width v0.1.9
Compiling bitflags v1.3.2
Compiling byteorder v1.4.3
Compiling strsim v0.8.0
Compiling ansi_term v0.12.1
Compiling vec_map v0.8.2
Compiling lazy_static v1.4.0
Compiling textwrap v0.11.0
Compiling daachorse v0.2.1
Compiling heck v0.3.3
Compiling proc-macro-error-attr v1.0.4
Compiling proc-macro-error v1.0.4
Compiling quote v1.0.15
Compiling atty v0.2.14
Compiling jobserver v0.1.24
Compiling clap v2.34.0
Compiling cc v1.0.72
Compiling zstd-sys v1.6.2+zstd.1.5.1
Compiling structopt-derive v0.4.18
Compiling structopt v0.3.26
Compiling zstd v0.9.2+zstd.1.5.1
Compiling bincode v1.3.3
Compiling vaporetto v0.2.0 (/work/vae_experiments/vaporetto/vaporetto)
Compiling convert_kytea_model v0.1.0 (/work/vae_experiments/vaporetto/convert_kytea_model)
Finished release [optimized] target(s) in 1m 14s
Running `target/release/convert_kytea_model --model-in jp-0.4.7-5-tokenize.model.zstd`
error: The following required arguments were not provided:
--model-out <model-out>
USAGE:
convert_kytea_model --model-in <model-in> --model-out <model-out>
I think, the correct command is:
% cargo run --release -p convert_kytea_model -- --model-in jp-0.4.7-5.mod --model-out jp-0.4.7-5-tokenize.model.zstd
I want to serialize & deserialize predictor to persistent it in file to reduce model building time when used as cli tool.
But I got DecodeError(UnexpectedEnd { additional: 1 })
error when deserializing.
However, when changing the parameter of predict_tags
to false
, deserializing works, but I need the yomi of each token.
Is it a bug or I missed something?
Here's my code:
use std::fs::File;
use std::io::Read;
use vaporetto::{Model, Predictor};
fn main() {
let file = File::open("./model.zst").unwrap();
let mut decoder = ruzstd::StreamingDecoder::new(file).unwrap();
let mut buffer = vec![];
decoder.read_to_end(&mut buffer).unwrap();
let (model, _) = Model::read_slice(&buffer).unwrap();
// when predict_tags set to false, it works
let predictor = Predictor::new(model, true).unwrap();
let serialized = predictor.serialize_to_vec().unwrap();
unsafe { Predictor::deserialize_from_slice_unchecked(&serialized).unwrap(); }
// DecodeError(UnexpectedEnd { additional: 1 })
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.