messense / jieba-rs Goto Github PK

View Code? Open in Web Editor NEW

726.0 13.0 46.0 2.95 MB

The Jieba Chinese Word Segmentation Implemented in Rust

License: MIT License

Rust 100.00%

chinese-word-segmentation jieba-chinese jieba nlp wasm

jieba-rs's Introduction

jieba-rs

🚀 Help me to become a full-time open-source developer by sponsoring me on GitHub

The Jieba Chinese Word Segmentation Implemented in Rust

Installation

Add it to your Cargo.toml:

[dependencies]
jieba-rs = "0.7"

then you are good to go. If you are using Rust 2015 you have to extern crate jieba_rs to your crate root as well.

Example

use jieba_rs::Jieba;

fn main() {
    let jieba = Jieba::new();
    let words = jieba.cut("我们中出了一个叛徒", false);
    assert_eq!(words, vec!["我们", "中", "出", "了", "一个", "叛徒"]);
}

Enabling Additional Features

default-dict feature enables embedded dictionary, this features is enabled by default
tfidf feature enables TF-IDF keywords extractor
textrank feature enables TextRank keywords extractor

[dependencies]
jieba-rs = { version = "0.7", features = ["tfidf", "textrank"] }

Run benchmark

cargo bench --all-features

Benchmark: Compare with cppjieba

`jieba-rs` bindings

License

This work is released under the MIT license. A copy of the license is provided in the LICENSE file.

jieba-rs's People

Contributors

Stargazers

Watchers

jieba-rs's Issues

Survey if DARTS (double array trie) could replace SwissTable for dictionary.

Double array trie is adopted in HanNLP for chinese segmentation, it has good properties and could be considered to be benchmarked against hashbrown implementation of SwissTable

I did naive benchmarking with the PATRICIA trie and critbit-trie implementation that I could find from crates.io but they are much slower than hashbrown.

Reference implementations, rust-darts is three year old without update, probably could not be compiled.

re_han custom

Currently only supports RE_HAN_DEFAULT and RE_HAN_CUT_ALL, but I want to customize re_han, consider supporting it in the future

能否修改load_dict的实现逻辑？

我发现你在new时load字典使用split + collect,这其实是一种很低效的方法，能否换手写不去alloc和split？这样在new时能快一点。

How to reduce the binary size?

Can we load the default-dict in runtime?

Review Marketing material

https://blog.paulme.ng/posts/2019-06-30-optimizing-jieba-rs-to-be-33percents-faster-than-cppjieba.html
I summarized my lesson-learned into a blog post. I'll post it to Reddit to give the project more visibility, but I would like to let you review it first. :-)

TextRank keywords extraction API

Using more sophiscated techniques for DAG

Right now the DAG is Vec<SmallVec<[usize; 5]>>, its layout is basically "[usize;5], [usize;5] .... [usize; 5]" in the memory (with metadata). I think there are still space for slight improvement. if we allocate in one chunk, that is Vec::with_capactiy(num_of_nodes * percentile(0.9, len_of_common_prefix))

Since the dictionary is static, we could pre-calculate the statistics from the dictionary to know pecentile(0.9, len_of_common_prefix))

If the length of the adjacent node exceeds pecentile(0.9, len_of_common_prefix)), then we could use linear probing technique from hashtable to do a linear search. And since they are adjacent in the memory, it probably would have better cache hit rate

Some ideas are from the talk of SwissTable
https://www.youtube.com/watch?v=ncHmEUmJZf4&t=8s

Design parallelization interface for Jieba

Survey if we could leverage on Rayon.
If Rayon is not a good fit, worker pool model seems to be a good start to me, by leveraging on some single-producer-multiple-consumer queue implementation (crossbeam maybe?)
An array of Futures could also be considered, and the merge of async/await in the next release of Rust would be great for ergonomics as well.

Provide C API

By providing C API so that the existing cppjieba users could switch over with minimum efforts.

Add test cases for C API

Compare the memory footprint with cppjieba

Using what tools to fairly calculate the memory usage?
Compare the executable binary size as well.

能否增加返回Iterator接口？

像常用的cut_xx接口，都是返回的Vec，但实际上是不必要alloc的。所以我建议增加或改动返回Iterator，减少alloc，提高性能

The current implementation only considers BMP for Han, we should consider the range defined by Unicode 10.0

https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

U+3400...U+4DBF
U+4E00...U+9FFF
U+F900...U+FAFF
U+20000...U+2A6DF
U+2A700...U+2B73F
U+2B740...U+2B81F
U+2B820...U+2CEAF
U+2CEB0...U+2EBEF
U+2F800...U+2FA1F

I checked the python implementation had the wrong assumptions as well, and there was a long pending issue without any progress.

english case insensitive

English is case-sensitive by default. How to set case-insensitive.

什么时候增加字典模式

支持自定义字典模式吗

`load_dict` should support `override/replace` mode

I want to maintain custom dictionary in some kind of dynamic config facility such as Apollo to make it hot reload without app reboot. It seems that every time calls load_dict would do merge , not just replace/override. That means words cannot be deleted/removed from dictionary.

Can load_dict provide override/replace mode or can we have an api like reload_dict?

中文档

中文文档放首页

是否支持古代散文和繁体字？

如左传，庄子，韩愈散文？

能否增加删除词条的方法

由于项目需求, 需要动态添加/删除词条. lib.rs 中已有 add_word, 需要 del_word
我自己先简单实现了一下, 用起来还凑合

/// delete word from dict, if the word doesn't exist, return `false`
pub fn del_word(&mut self, word: &str) -> bool {
    match self.cedar.exact_match_search(word) {
        Some((word_id, _, _)) => {
            let old_freq = self.records[word_id as usize].freq;
            self.total -= old_freq;
            // self.records.remove(word_id as usize);
            // 这里不能直接删除，因为删除后，后面的 word_id 就可能重复
            self.records[word_id as usize] = Record::new(0, String::new());
            self.cedar.erase(word);
            true
        }
        None => false,
    }
}

测试代码如下

#[test]
fn test_add_remove_word() {
    let mut jieba = Jieba::empty();
    jieba.add_word("东西", Some(1000), None);
    jieba.add_word("石墨烯", Some(1000), None);
    let words = jieba.cut("石墨烯是好东西", false);
    assert_eq!(words, vec!["石墨烯", "是", "好", "东西"]);

    // println!("{:?}", jieba.records);

    jieba.del_word("石墨烯");
    let words = jieba.cut("石墨烯是好东西", false);
    assert_eq!(words, vec!["石", "墨", "烯", "是", "好", "东西"]);

    // println!("{:?}", jieba.records);

    jieba.add_word("石墨烯", Some(1000), None);
    let words = jieba.cut("石墨烯是好东西", false);
    assert_eq!(words, vec!["石墨烯", "是", "好", "东西"]);

    // println!("{:?}", jieba.records);
}

load_dict stop if user_dict parse error, and no error information printed

on lib.rs 288 line

map_err still behave likes unwrap,and you can't get infomation about what is wrong.

maybe change the code

-                let freq = parts
-                    .get(1)
-                    .map(|x| {
-                        x.parse::<usize>()
-                            .map_err(|e| Error::InvalidDictEntry(format!("{}", e)))
-                    })
-                    .unwrap_or(Ok(0))?;

like this, then users can know what's the matter. you can decide to continue or stop reading.

+                let freq = {
+                    if let Some(m_freq) = parts.get(1) {
+                        if let Ok(mm_freq) = m_freq.parse::<usize>() {
+                            mm_freq
+                        } else {
+                            println!("parse errorr {:?}", &parts);
+                            0
+                        }
+                    } else {
+                        println!("get nothing {:?}", &parts);
+                        0
+                    }
+                };

`index out of bounds` when using custom dictionary

Test Case

use jieba_rs::Jieba;

fn main() {
    use std::{fs::File, io::BufReader};
    let mut dict = BufReader::new(File::open("/home/goodguy/dict.txt.big").unwrap());
    let jieba = Jieba::with_dict(&mut dict).unwrap();
    println!("{:?}", jieba.cut("Firefox全球市佔率為35％至40%，為全球第二流行的網頁瀏覽器[17][18][19][20][21]。Firefox在某些國家還是最流行的網頁瀏覽器，如在薩摩亞、德國、厄利垂亞及古巴，Firefox市佔率分別為61.05%、38.36%、79.39%及85.93%。據Mozilla統計，截至2014年12月，Firefox在全世界擁有10億使用者[22]。", false));
}

The dict.txt.big can fetch from https://github.com/fxsjy/jieba/blob/master/extra_dict/dict.txt.big

Result

thread 'main' panicked at 'index out of bounds: the len is 33908 but the index is 67813', /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2695:10
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:197
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:211
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:474
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:381
   6: rust_begin_unwind
             at src/libstd/panicking.rs:308
   7: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
   8: core::panicking::panic_bounds_check
             at src/libcore/panicking.rs:61
   9: <usize as core::slice::SliceIndex<[T]>>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2695
  10: core::slice::<impl core::ops::index::Index<I> for [T]>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2552
  11: <alloc::vec::Vec<T> as core::ops::index::Index<I>>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/liballoc/vec.rs:1687
  12: jieba_rs::Jieba::calc::{{closure}}
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:359
  13: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/ops/function.rs:279
  14: core::option::Option<T>::map
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/option.rs:416
  15: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/adapters/mod.rs:570
  16: core::iter::traits::iterator::select_fold1
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/traits/iterator.rs:2599
  17: core::iter::traits::iterator::Iterator::max_by
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/traits/iterator.rs:2082
  18: jieba_rs::Jieba::calc
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:349
  19: jieba_rs::Jieba::cut_dag_no_hmm
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:421
  20: jieba_rs::Jieba::cut_internal
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:572
  21: jieba_rs::Jieba::cut
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:612
  22: testproj::main
             at src/main.rs:7
  23: std::rt::lang_start::{{closure}}
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/rt.rs:64
  24: std::panicking::try::do_call
             at src/libstd/rt.rs:49
             at src/libstd/panicking.rs:293
  25: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:85
  26: std::rt::lang_start_internal
             at src/libstd/panicking.rs:272
             at src/libstd/panic.rs:394
             at src/libstd/rt.rs:48
  27: std::rt::lang_start
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/rt.rs:64
  28: main
  29: __libc_start_main
  30: _start

Clue

Look into the dict.txt.big...

些許端 2 m
些许 163 d
些许端 2 m
些須 3 d
些须 3 d
亜 20 zg
亝 6 zg
亞 13 zg
亞 5789 j
亞丁 57 nrt
亞丁港 4 ns
亞丁灣 37 ns
亞世 6 nrt
亞世達 2 nr
亞乃夫 2 nr
亞之傑 3 nr
亞乙基 2 nr
亞伊傑 2 nr
亞伯 2 ns
亞伯奎 6 nr
亞伯拉罕 19 nrt

... when I delete all the lines after 亞 5789 j (line number 33908), the panic disappeared. The result look like following:

["Firefox", "全", "球", "市", "佔", "率", "為", "35", "％", "至", "40", "%", "，", "為", "全", "球", "第", "二流", "行", "的", "網", "頁", "瀏", "覽", "器", "[", "17", "]", "[", "18", "]", "[", "19", "]", "[", "20", "]", "[", "21", "]", "。", "Firefox", "在", "某", "些", "國", "家", "還", "是", "最", "流", "行", "的", "網", "頁", "瀏", "覽", "器", "，", "如", "在", "薩", "摩", "亞", "、", "德", "國", "、", "厄", "利", "垂", "亞", "及", "古", "巴", "，", "Firefox", "市", "佔", "率", "分", "別", "為", "61", ".", "05", "%", "、", "38", ".", "36", "%", "、", "79", ".", "39", "%", "及", "85", ".", "93", "%", "。", "據", "Mozilla", "統", "計", "，", "截", "至", "2014", "年", "12", "月", "，", "Firefox", "在", "全", "世界", "擁", "有", "10", "億", "使", "用", "者", "[", "22", "]", "。"]

Env

debian buster
rustc 1.36.0 stable
jieba-rs 0.4.9

add_word panic: attempt to subtract with overflow

use jieba_rs::Jieba;

fn main() {
    let mut jieba = Jieba::new();
    jieba.add_word("测试", Some(10), None);
}

thread 'main' panicked at 'attempt to subtract with overflow', ~/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.10/src/lib.rs:254:31

Setup criterion micro-benchmarking

https://docs.rs/criterion

Prefer builder pattern to setters/getters in KeywordExtractConfig

Fromt here https://github.com/messense/jieba-rs/pull/100/files#r1560432915

@messense said

I'd remove getters (not really useful) and use builder pattern for KeywordExtractConfig, like
let config = KeywordExtractConfig::builder()
    .add_stop_word("word")
    .use_hmm(true)
    // and other options
    .build();
or without a separate builder type:
let config = KeywordExtractConfig::default()
    .add_stop_word("word")
    .use_hmm(true)
    // and other options
    ;

The builder pattern is indeed nicer here. The getters may still be wanted to allow other languages to query the rust struct w/o replicating the state...but builder does seem like a definite win.

`Jieba::add_word` panics when given empty `word`

Example:

use jieba_rs::Jieba;

fn main() {
    let mut jieba = Jieba::new();
    jieba.add_word("", None, None);
}

gives

thread 'main' panicked at /path/to/cedarwood-0.4.6/src/lib.rs:302:13:
failed to insert zero-length key

Should the panic message be improved, and a note be added to the doc?

TF-IDF keywords extraction API

How to train your own HMM model

Reduce the BinaryHeap size to be `k`

To retrieve the top k, we only need k node for BinaryHeap but not n.

We need to rewrite the following snippets by using a min heap.

        let mut heap = BinaryHeap::new();
        for (k, v) in ranking_vector.iter().enumerate() {
            heap.push(HeapNode {
                rank: (v * 1e10) as u64,
                word_id: k,
            })
        }

        let mut res: Vec<String> = Vec::new();
        for _ in 0..top_k {
            if let Some(w) = heap.pop() {
                res.push(unique_words[w.word_id].clone());
            }
        }

how to use extra custom user dict?

If I want use an extra dict file aloneside defulat dict fie, how should I achiveve this?

Add weight value return to KeywordExtract API

As requested here: napi-rs/node-rs#79

Add API documentation

Edge case: "test-1"

When I tried to cut test-1, the result was wrong.

let words = jieba.cut_for_search("test-1", true);
assert_eq!(words, vec!["test", "-", "1"]);

// panicked at 'assertion failed: `(left == right)
// left: `["test-1"]`,
// right: `["test", "-", "1"]`'

This seems to be related to 7a520c1

Allow disable embedded dictionary

Survey HAT-trie to be the next generation of DARTS

It seems to be the fastest trie but more complicated to implement.

Make APIs for TFIDF and TextRank that do NOT take a reference to Jieba?

I tried to implementing a binding of Jieba-rs in Elixir here https://github.com/awong-dev/jieba.

When it came to the TFIDF<'a> and TextRank<'a> structs, it became hard (impossible?) to provide a sensible API to Elixir because the lifetimes of the types required that they be stack-scoped.

In an ideal world, you would conceptually want to allow create a TFIDF/TextRank struct that had its lifetime managed Elixir where you can load into the TFIDF and TextRank instances (eg via add_stop_word() or even load_dict()) once and then use them later as needed.

With the current setup where jieba_rs requires TFIDF and and TextRank to be bound to the stack are constructed in means that any wrapping API has to recreate the two structs on each call to extract_tags(). See my code here:

https://github.com/awong-dev/jieba/blob/main/native/rustler_jieba/src/lib.rs#L232

If there are not many stop words, etc., this is cheap but if there are a lot, this is very wasteful.

How would you feel about exposing something like

pub struct TFIDFState {
    idf_dict: HashMap<String, f64>,
    median_idf: f64,
    stop_words: BTreeSet<String>,
}

impl TFIDFState {
  pub fn clone() -> Self {...}
}

impl<'a> TFIDF<'a> {
    pub fn new_with_jieba_and_state(jieba: &'a Jieba, TFIDFState state) -> Self {...}
    pub fn extract_state() -> Self { /* TFIDF data put into an empty state. */ }
    ...
}

and something similar for TextRank.

This would allow both TFIDF<'a> and TextRank<'a> be used as cheap-to-construct, thin facades with lifetimes bound to a jieba instance and not break the existing API.

Alternatively, if we're willing to break API compat, I wonder if TFIDF and TextRank would be better off NOT binding jieba during construction. If the KeywordExtract was

fn extract_tags(
    &self,
    &jieba: Jieba,
    sentence: &str,
    top_k: usize,
    allowed_pos: Vec<String>
) -> Vec<Keyword.html>

where you pass in the wanted jieba on each invocation of extract_tags, we'd avoid the lifetime coupling of both structs entirely and simplify the API.

As a bonus, it is easy to use one KeywordExtract instance with multiple segmenters in case you wanted to test behavior with different Jieba dictionaries.

Thoughts?

Come up with a way to handle extended grapheme clusters

Examples here: https://developer.apple.com/swift/blog/?id=30

"abcde\u{0301}\u{1100}\u{1161}\u{AC00}" should not be segmented as "abcde" and "\u{0301}\u{1100}\u{1161}\u{AC00}". "e\u{0301}" should be together.

Have example FFI binding for either one of python, nodejs, go.

We can fork the popular one like nodejieba

Need to wait for the completion of C API.

Fix cut_all mixed chinese & english issue

The same as the fix of the Python version: fxsjy/jieba@97c3246

Benchmark different DAG implementations

Given that we know the upper bound of DAG's node count is the count of char (even we ignore the concept of grapheme cluster)

We have two options to represent the DAG

BTreeMap<usize, SmallVec<[usize; 5]>>;
Vec<SmallVec<[usize; 5]>

I suspect the second one would be much faster then BTreeMap, but it wastes the space if the graph is sparse. I think it would be great to use heuristic approach, where using Vec when the sentence is shorter than say, 1024. Otherwise use BTreeMap

add field `tag` for struct `Keyword`

extract_tags方法无法获取Keyword的tag, 与python api不一致

add_word：2个字的分词无效，3个字就是正常的

let mut jieba = Jieba::new();
jieba.add_word("莞城", None, None);
let s1 = "广东省东莞市莞城区";
jieba.cut(s, true);

返回["广东省", "东莞市", "莞", "城区"]，无法识别“莞城”

但把word设成“莞城区”就是正常的

另外想问一下，load_dict加载的文件中的词频和词性是否可省略，我试验只放词语，好像不起作用。