Giter VIP home page Giter VIP logo

jieba-rs's Introduction

jieba-rs

GitHub Actions codecov Crates.io docs.rs

🚀 Help me to become a full-time open-source developer by sponsoring me on GitHub

The Jieba Chinese Word Segmentation Implemented in Rust

Installation

Add it to your Cargo.toml:

[dependencies]
jieba-rs = "0.7"

then you are good to go. If you are using Rust 2015 you have to extern crate jieba_rs to your crate root as well.

Example

use jieba_rs::Jieba;

fn main() {
    let jieba = Jieba::new();
    let words = jieba.cut("我们中出了一个叛徒", false);
    assert_eq!(words, vec!["我们", "中", "出", "了", "一个", "叛徒"]);
}

Enabling Additional Features

  • default-dict feature enables embedded dictionary, this features is enabled by default
  • tfidf feature enables TF-IDF keywords extractor
  • textrank feature enables TextRank keywords extractor
[dependencies]
jieba-rs = { version = "0.7", features = ["tfidf", "textrank"] }

Run benchmark

cargo bench --all-features

Benchmark: Compare with cppjieba

jieba-rs bindings

License

This work is released under the MIT license. A copy of the license is provided in the LICENSE file.

jieba-rs's People

Contributors

atry avatar awong-dev avatar dcjanus avatar fengkx avatar h-a-n-a avatar kianmeng avatar messense avatar mno2 avatar windoze avatar zh217 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jieba-rs's Issues

Survey if DARTS (double array trie) could replace SwissTable for dictionary.

Double array trie is adopted in HanNLP for chinese segmentation, it has good properties and could be considered to be benchmarked against hashbrown implementation of SwissTable

I did naive benchmarking with the PATRICIA trie and critbit-trie implementation that I could find from crates.io but they are much slower than hashbrown.

Reference implementations, rust-darts is three year old without update, probably could not be compiled.

注音符号的问题

这句话:

字母c ĉ ĝ ĥ ĵ ŝ和复元音中的ŭ分别读作ts tʃ dʒ x ʒ ʃ u̯。以捷克式IPA转写应该写成ts,tš,dž,x,ž,š,w。

切分成:

字母 c ĉ ĝ ĥ ĵ ŝ 和 复 元音 中 的 ŭ 分别 读作 ts t ʃ d ʒ x ʒ ʃ u ̯ 。 以 捷克 式 IPA 转写 应该 写成 ts , t š , d ž , x , ž , š , w 。

请问能否保持而不切分成u ̯

re_han custom

Currently only supports RE_HAN_DEFAULT and RE_HAN_CUT_ALL, but I want to customize re_han, consider supporting it in the future

能否修改load_dict的实现逻辑?

我发现你在new时load字典使用split + collect,这其实是一种很低效的方法,能否换手写不去alloc和split?这样在new时能快一点。

Using more sophiscated techniques for DAG

Right now the DAG is Vec<SmallVec<[usize; 5]>>, its layout is basically "[usize;5], [usize;5] .... [usize; 5]" in the memory (with metadata). I think there are still space for slight improvement. if we allocate in one chunk, that is Vec::with_capactiy(num_of_nodes * percentile(0.9, len_of_common_prefix))

Since the dictionary is static, we could pre-calculate the statistics from the dictionary to know pecentile(0.9, len_of_common_prefix))

If the length of the adjacent node exceeds pecentile(0.9, len_of_common_prefix)), then we could use linear probing technique from hashtable to do a linear search. And since they are adjacent in the memory, it probably would have better cache hit rate

Some ideas are from the talk of SwissTable
https://www.youtube.com/watch?v=ncHmEUmJZf4&t=8s

Design parallelization interface for Jieba

  • Survey if we could leverage on Rayon.
  • If Rayon is not a good fit, worker pool model seems to be a good start to me, by leveraging on some single-producer-multiple-consumer queue implementation (crossbeam maybe?)
  • An array of Futures could also be considered, and the merge of async/await in the next release of Rust would be great for ergonomics as well.

Provide C API

By providing C API so that the existing cppjieba users could switch over with minimum efforts.

能否增加返回Iterator接口?

像常用的cut_xx接口,都是返回的Vec,但实际上是不必要alloc的。所以我建议增加或改动返回Iterator,减少alloc,提高性能

`load_dict` should support `override/replace` mode

I want to maintain custom dictionary in some kind of dynamic config facility such as Apollo to make it hot reload without app reboot. It seems that every time calls load_dict would do merge , not just replace/override. That means words cannot be deleted/removed from dictionary.

Can load_dict provide override/replace mode or can we have an api like reload_dict?

能否增加删除词条的方法

由于项目需求, 需要动态添加/删除词条. lib.rs 中已有 add_word, 需要 del_word
我自己先简单实现了一下, 用起来还凑合

/// delete word from dict, if the word doesn't exist, return `false`
pub fn del_word(&mut self, word: &str) -> bool {
    match self.cedar.exact_match_search(word) {
        Some((word_id, _, _)) => {
            let old_freq = self.records[word_id as usize].freq;
            self.total -= old_freq;
            // self.records.remove(word_id as usize);
            // 这里不能直接删除,因为删除后,后面的 word_id 就可能重复
            self.records[word_id as usize] = Record::new(0, String::new());
            self.cedar.erase(word);
            true
        }
        None => false,
    }
}

测试代码如下

#[test]
fn test_add_remove_word() {
    let mut jieba = Jieba::empty();
    jieba.add_word("东西", Some(1000), None);
    jieba.add_word("石墨烯", Some(1000), None);
    let words = jieba.cut("石墨烯是好东西", false);
    assert_eq!(words, vec!["石墨烯", "是", "好", "东西"]);

    // println!("{:?}", jieba.records);

    jieba.del_word("石墨烯");
    let words = jieba.cut("石墨烯是好东西", false);
    assert_eq!(words, vec!["石", "墨", "烯", "是", "好", "东西"]);

    // println!("{:?}", jieba.records);

    jieba.add_word("石墨烯", Some(1000), None);
    let words = jieba.cut("石墨烯是好东西", false);
    assert_eq!(words, vec!["石墨烯", "是", "好", "东西"]);

    // println!("{:?}", jieba.records);
}

load_dict stop if user_dict parse error, and no error information printed

on lib.rs 288 line

map_err still behave likes unwrap,and you can't get infomation about what is wrong.

maybe change the code

-                let freq = parts
-                    .get(1)
-                    .map(|x| {
-                        x.parse::<usize>()
-                            .map_err(|e| Error::InvalidDictEntry(format!("{}", e)))
-                    })
-                    .unwrap_or(Ok(0))?;

like this, then users can know what's the matter. you can decide to continue or stop reading.

+                let freq = {
+                    if let Some(m_freq) = parts.get(1) {
+                        if let Ok(mm_freq) = m_freq.parse::<usize>() {
+                            mm_freq
+                        } else {
+                            println!("parse errorr {:?}", &parts);
+                            0
+                        }
+                    } else {
+                        println!("get nothing {:?}", &parts);
+                        0
+                    }
+                };

`index out of bounds` when using custom dictionary

Test Case

use jieba_rs::Jieba;

fn main() {
    use std::{fs::File, io::BufReader};
    let mut dict = BufReader::new(File::open("/home/goodguy/dict.txt.big").unwrap());
    let jieba = Jieba::with_dict(&mut dict).unwrap();
    println!("{:?}", jieba.cut("Firefox全球市佔率為35%至40%,為全球第二流行的網頁瀏覽器[17][18][19][20][21]。Firefox在某些國家還是最流行的網頁瀏覽器,如在薩摩亞、德國、厄利垂亞及古巴,Firefox市佔率分別為61.05%、38.36%、79.39%及85.93%。據Mozilla統計,截至2014年12月,Firefox在全世界擁有10億使用者[22]。", false));
}

The dict.txt.big can fetch from https://github.com/fxsjy/jieba/blob/master/extra_dict/dict.txt.big

Result

thread 'main' panicked at 'index out of bounds: the len is 33908 but the index is 67813', /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2695:10
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:39
   1: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at src/libstd/sys_common/backtrace.rs:59
             at src/libstd/panicking.rs:197
   3: std::panicking::default_hook
             at src/libstd/panicking.rs:211
   4: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:474
   5: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:381
   6: rust_begin_unwind
             at src/libstd/panicking.rs:308
   7: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
   8: core::panicking::panic_bounds_check
             at src/libcore/panicking.rs:61
   9: <usize as core::slice::SliceIndex<[T]>>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2695
  10: core::slice::<impl core::ops::index::Index<I> for [T]>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/slice/mod.rs:2552
  11: <alloc::vec::Vec<T> as core::ops::index::Index<I>>::index
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/liballoc/vec.rs:1687
  12: jieba_rs::Jieba::calc::{{closure}}
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:359
  13: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/ops/function.rs:279
  14: core::option::Option<T>::map
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/option.rs:416
  15: <core::iter::adapters::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/adapters/mod.rs:570
  16: core::iter::traits::iterator::select_fold1
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/traits/iterator.rs:2599
  17: core::iter::traits::iterator::Iterator::max_by
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libcore/iter/traits/iterator.rs:2082
  18: jieba_rs::Jieba::calc
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:349
  19: jieba_rs::Jieba::cut_dag_no_hmm
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:421
  20: jieba_rs::Jieba::cut_internal
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:572
  21: jieba_rs::Jieba::cut
             at /home/goodguy/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.9/src/lib.rs:612
  22: testproj::main
             at src/main.rs:7
  23: std::rt::lang_start::{{closure}}
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/rt.rs:64
  24: std::panicking::try::do_call
             at src/libstd/rt.rs:49
             at src/libstd/panicking.rs:293
  25: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:85
  26: std::rt::lang_start_internal
             at src/libstd/panicking.rs:272
             at src/libstd/panic.rs:394
             at src/libstd/rt.rs:48
  27: std::rt::lang_start
             at /rustc/a53f9df32fbb0b5f4382caaad8f1a46f36ea887c/src/libstd/rt.rs:64
  28: main
  29: __libc_start_main
  30: _start

Clue

Look into the dict.txt.big...

些許端 2 m
些许 163 d
些许端 2 m
些須 3 d
些须 3 d
亜 20 zg
亝 6 zg
亞 13 zg
亞 5789 j
亞丁 57 nrt
亞丁港 4 ns
亞丁灣 37 ns
亞世 6 nrt
亞世達 2 nr
亞乃夫 2 nr
亞之傑 3 nr
亞乙基 2 nr
亞伊傑 2 nr
亞伯 2 ns
亞伯奎 6 nr
亞伯拉罕 19 nrt

... when I delete all the lines after 亞 5789 j (line number 33908), the panic disappeared. The result look like following:

["Firefox", "全", "球", "市", "佔", "率", "為", "35", "%", "至", "40", "%", ",", "為", "全", "球", "第", "二流", "行", "的", "網", "頁", "瀏", "覽", "器", "[", "17", "]", "[", "18", "]", "[", "19", "]", "[", "20", "]", "[", "21", "]", "。", "Firefox", "在", "某", "些", "國", "家", "還", "是", "最", "流", "行", "的", "網", "頁", "瀏", "覽", "器", ",", "如", "在", "薩", "摩", "亞", "、", "德", "國", "、", "厄", "利", "垂", "亞", "及", "古", "巴", ",", "Firefox", "市", "佔", "率", "分", "別", "為", "61", ".", "05", "%", "、", "38", ".", "36", "%", "、", "79", ".", "39", "%", "及", "85", ".", "93", "%", "。", "據", "Mozilla", "統", "計", ",", "截", "至", "2014", "年", "12", "月", ",", "Firefox", "在", "全", "世界", "擁", "有", "10", "億", "使", "用", "者", "[", "22", "]", "。"]

Env

  • debian buster
  • rustc 1.36.0 stable
  • jieba-rs 0.4.9

add_word panic: attempt to subtract with overflow

use jieba_rs::Jieba;

fn main() {
    let mut jieba = Jieba::new();
    jieba.add_word("测试", Some(10), None);
}

thread 'main' panicked at 'attempt to subtract with overflow', ~/.cargo/registry/src/github.com-1ecc6299db9ec823/jieba-rs-0.4.10/src/lib.rs:254:31

Prefer builder pattern to setters/getters in KeywordExtractConfig

Fromt here https://github.com/messense/jieba-rs/pull/100/files#r1560432915

@messense said

I'd remove getters (not really useful) and use builder pattern for KeywordExtractConfig, like

let config = KeywordExtractConfig::builder()
    .add_stop_word("word")
    .use_hmm(true)
    // and other options
    .build();

or without a separate builder type:

let config = KeywordExtractConfig::default()
    .add_stop_word("word")
    .use_hmm(true)
    // and other options
    ;

The builder pattern is indeed nicer here. The getters may still be wanted to allow other languages to query the rust struct w/o replicating the state...but builder does seem like a definite win.

`Jieba::add_word` panics when given empty `word`

Example:

use jieba_rs::Jieba;

fn main() {
    let mut jieba = Jieba::new();
    jieba.add_word("", None, None);
}

gives

thread 'main' panicked at /path/to/cedarwood-0.4.6/src/lib.rs:302:13:
failed to insert zero-length key

Should the panic message be improved, and a note be added to the doc?

Reduce the BinaryHeap size to be `k`

To retrieve the top k, we only need k node for BinaryHeap but not n.

We need to rewrite the following snippets by using a min heap.

        let mut heap = BinaryHeap::new();
        for (k, v) in ranking_vector.iter().enumerate() {
            heap.push(HeapNode {
                rank: (v * 1e10) as u64,
                word_id: k,
            })
        }

        let mut res: Vec<String> = Vec::new();
        for _ in 0..top_k {
            if let Some(w) = heap.pop() {
                res.push(unique_words[w.word_id].clone());
            }
        }

Edge case: "test-1"

When I tried to cut test-1, the result was wrong.

let words = jieba.cut_for_search("test-1", true);
assert_eq!(words, vec!["test", "-", "1"]);

// panicked at 'assertion failed: `(left == right)
// left: `["test-1"]`,
// right: `["test", "-", "1"]`'

This seems to be related to 7a520c1

Make APIs for TFIDF and TextRank that do NOT take a reference to Jieba?

I tried to implementing a binding of Jieba-rs in Elixir here https://github.com/awong-dev/jieba.

When it came to the TFIDF<'a> and TextRank<'a> structs, it became hard (impossible?) to provide a sensible API to Elixir because the lifetimes of the types required that they be stack-scoped.

In an ideal world, you would conceptually want to allow create a TFIDF/TextRank struct that had its lifetime managed Elixir where you can load into the TFIDF and TextRank instances (eg via add_stop_word() or even load_dict()) once and then use them later as needed.

With the current setup where jieba_rs requires TFIDF and and TextRank to be bound to the stack are constructed in means that any wrapping API has to recreate the two structs on each call to extract_tags(). See my code here:

https://github.com/awong-dev/jieba/blob/main/native/rustler_jieba/src/lib.rs#L232

If there are not many stop words, etc., this is cheap but if there are a lot, this is very wasteful.

How would you feel about exposing something like

pub struct TFIDFState {
    idf_dict: HashMap<String, f64>,
    median_idf: f64,
    stop_words: BTreeSet<String>,
}

impl TFIDFState {
  pub fn clone() -> Self {...}
}

impl<'a> TFIDF<'a> {
    pub fn new_with_jieba_and_state(jieba: &'a Jieba, TFIDFState state) -> Self {...}
    pub fn extract_state() -> Self { /* TFIDF data put into an empty state. */ }
    ...
}

and something similar for TextRank.

This would allow both TFIDF<'a> and TextRank<'a> be used as cheap-to-construct, thin facades with lifetimes bound to a jieba instance and not break the existing API.

Alternatively, if we're willing to break API compat, I wonder if TFIDF and TextRank would be better off NOT binding jieba during construction. If the KeywordExtract was

fn extract_tags(
    &self,
    &jieba: Jieba,
    sentence: &str,
    top_k: usize,
    allowed_pos: Vec<String>
) -> Vec<Keyword.html>

where you pass in the wanted jieba on each invocation of extract_tags, we'd avoid the lifetime coupling of both structs entirely and simplify the API.

As a bonus, it is easy to use one KeywordExtract instance with multiple segmenters in case you wanted to test behavior with different Jieba dictionaries.

Thoughts?

Benchmark different DAG implementations

Given that we know the upper bound of DAG's node count is the count of char (even we ignore the concept of grapheme cluster)

We have two options to represent the DAG

  1. BTreeMap<usize, SmallVec<[usize; 5]>>;
  2. Vec<SmallVec<[usize; 5]>

I suspect the second one would be much faster then BTreeMap, but it wastes the space if the graph is sparse. I think it would be great to use heuristic approach, where using Vec when the sentence is shorter than say, 1024. Otherwise use BTreeMap

add_word:2个字的分词无效,3个字就是正常的

let mut jieba = Jieba::new();
jieba.add_word("莞城", None, None);
let s1 = "广东省东莞市莞城区";
jieba.cut(s, true);

返回["广东省", "东莞市", "莞", "城区"],无法识别“莞城”

但把word设成“莞城区”就是正常的

另外想问一下,load_dict加载的文件中的词频和词性是否可省略,我试验只放词语,好像不起作用。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.