Giter VIP home page Giter VIP logo

trie-rs's Introduction

trie-rs

Memory efficient trie (prefix tree) and map library based on LOUDS.

Master API Docs | Released API Docs | Benchmark Results | Changelog

GitHub Actions Status Crates.io Version Crates.io Downloads Minimum rustc version License: MIT License: Apache 2.0

Quickstart

To use trie-rs, add the following to your Cargo.toml file:

[dependencies]
trie-rs = "0.4.2"

Usage Overview

use std::str;
use trie_rs::TrieBuilder;

let mut builder = TrieBuilder::new();  // Inferred `TrieBuilder<u8>` automatically
builder.push("すし");
builder.push("すしや");
builder.push("すしだね");
builder.push("すしづめ");
builder.push("すしめし");
builder.push("すしをにぎる");
builder.push("すし");  // Word `push`ed twice is just ignored.
builder.push("🍣");

let trie = builder.build();

// exact_match(): Find a word exactly match to query.
assert_eq!(trie.exact_match("すし"), true);
assert_eq!(trie.exact_match("🍣"), true);
assert_eq!(trie.exact_match("🍜"), false);

// predictive_search(): Find words which include `query` as their prefix.
let results_in_u8s: Vec<Vec<u8>> = trie.predictive_search("すし").collect();
let results_in_str: Vec<String> = trie.predictive_search("すし").collect();
assert_eq!(
    results_in_str,
    vec![
        "すし",
        "すしだね",
        "すしづめ",
        "すしめし",
        "すしや",
        "すしをにぎる"
    ]  // Sorted by `Vec<u8>`'s order
);

// common_prefix_search(): Find words which is included in `query`'s prefix.
let results_in_u8s: Vec<Vec<u8>> = trie.common_prefix_search("すしや").collect();
let results_in_str: Vec<String> = trie.common_prefix_search("すしや").collect();
assert_eq!(
    results_in_str,
    vec![
        "すし",
        "すしや",
    ]  // Sorted by `Vec<u8>`'s order
);

Using with Various Data Types

TrieBuilder is implemented using generic type like following:

impl<Label: Ord> TrieBuilder<Label> {
    ...
    pub fn push<Arr: AsRef<[Label]>>(&mut self, word: Arr) where Label: Clone { ... }
    ...
}

In the above Usage Overview example, we used Label=u8, Arr=&str. If Label does not implement Clone, use [insert()][crate::trie::TrieBuilder::insert].

Here shows other Label and Arr type examples.

Label=&str, Arr=Vec<&str>

Say Label is English words and Arr is English phrases.

use trie_rs::TrieBuilder;

let mut builder = TrieBuilder::new();
builder.push(vec!["a", "woman"]);
builder.push(vec!["a", "woman", "on", "the", "beach"]);
builder.push(vec!["a", "woman", "on", "the", "run"]);

let trie = builder.build();

assert_eq!(
    trie.exact_match(vec!["a", "woman", "on", "the", "beach"]),
    true
);
let r: Vec<Vec<&str>> = trie.predictive_search(vec!["a", "woman", "on"]).collect();
assert_eq!(
    r,
    vec![
        ["a", "woman", "on", "the", "beach"],
        ["a", "woman", "on", "the", "run"],
    ],
);
let s: Vec<Vec<&str>> = trie.common_prefix_search(vec!["a", "woman", "on", "the", "beach"]).collect();
assert_eq!(
    s,
    vec![vec!["a", "woman"], vec!["a", "woman", "on", "the", "beach"]],
);

Label=u8, Arr=[u8; n]

Say Label is a digit in Pi (= 3.14...) and Arr is a window to separate pi's digit by 10.

use trie_rs::TrieBuilder;

let mut builder = TrieBuilder::<u8>::new(); // Pi = 3.14...

builder.push([1, 4, 1, 5, 9, 2, 6, 5, 3, 5]);
builder.push([8, 9, 7, 9, 3, 2, 3, 8, 4, 6]);
builder.push([2, 6, 4, 3, 3, 8, 3, 2, 7, 9]);
builder.push([6, 9, 3, 9, 9, 3, 7, 5, 1, 0]);
builder.push([5, 8, 2, 0, 9, 7, 4, 9, 4, 4]);
builder.push([5, 9, 2, 3, 0, 7, 8, 1, 6, 4]);
builder.push([0, 6, 2, 8, 6, 2, 0, 8, 9, 9]);
builder.push([8, 6, 2, 8, 0, 3, 4, 8, 2, 5]);
builder.push([3, 4, 2, 1, 1, 7, 0, 6, 7, 9]);
builder.push([8, 2, 1, 4, 8, 0, 8, 6, 5, 1]);
builder.push([3, 2, 8, 2, 3, 0, 6, 6, 4, 7]);
builder.push([0, 9, 3, 8, 4, 4, 6, 0, 9, 5]);
builder.push([5, 0, 5, 8, 2, 2, 3, 1, 7, 2]);
builder.push([5, 3, 5, 9, 4, 0, 8, 1, 2, 8]);

let trie = builder.build();

assert_eq!(trie.exact_match([5, 3, 5, 9, 4, 0, 8, 1, 2, 8]), true);

let t: Vec<Vec<u8>> = trie.predictive_search([3]).collect();
assert_eq!(
    t,
    vec![
        [3, 2, 8, 2, 3, 0, 6, 6, 4, 7],
        [3, 4, 2, 1, 1, 7, 0, 6, 7, 9],
    ],
);
let u: Vec<Vec<u8>> = trie.common_prefix_search([1, 4, 1, 5, 9, 2, 6, 5, 3, 5]).collect();
assert_eq!(
    u,
    vec![[1, 4, 1, 5, 9, 2, 6, 5, 3, 5]],
);

Trie Map Usage

To store a value with each word, use trie_rs::map::{Trie, TrieBuilder}.

use std::str;
use trie_rs::map::TrieBuilder;

let mut builder = TrieBuilder::new();  // Inferred `TrieBuilder<u8, u8>` automatically
builder.push("すし", 0);
builder.push("すしや", 1);
builder.push("すしだね", 2);
builder.push("すしづめ", 3);
builder.push("すしめし", 4);
builder.push("すしをにぎる", 5);
builder.push("すし", 6);  // Word `push`ed twice uses last value.
builder.push("🍣", 7);

let mut trie = builder.build();

// exact_match(): Find a word exactly match to query.
assert_eq!(trie.exact_match("すし"), Some(&6));
assert_eq!(trie.exact_match("🍣"), Some(&7));
assert_eq!(trie.exact_match("🍜"), None);

// Values can be modified.
let v = trie.exact_match_mut("🍣").unwrap();
*v = 8;
assert_eq!(trie.exact_match("🍣"), Some(&8));

Incremental Search

For interactive applications, one can use an incremental search to get the best performance. See [IncSearch][crate::inc_search::IncSearch].

use std::str;
use trie_rs::{TrieBuilder, inc_search::Answer};

let mut builder = TrieBuilder::new();  // Inferred `TrieBuilder<u8, u8>` automatically
builder.push("ab");
builder.push("すし");
builder.push("すしや");
builder.push("すしだね");
builder.push("すしづめ");
builder.push("すしめし");
builder.push("すしをにぎる");
let trie = builder.build();
let mut search = trie.inc_search();

// Query by the byte.
assert_eq!(search.query(&b'a'), Some(Answer::Prefix));
assert_eq!(search.query(&b'c'), None);
assert_eq!(search.query(&b'b'), Some(Answer::Match));

// Reset the query to go again.
search.reset();

// For unicode its easier to use .query_until().
assert_eq!(search.query_until("す"), Ok(Answer::Prefix));
assert_eq!(search.query_until("し"), Ok(Answer::PrefixAndMatch));
assert_eq!(search.query_until("や"), Ok(Answer::Match));
assert_eq!(search.query(&b'a'), None);
assert_eq!(search.query_until("a"), Err(0));

search.reset();
assert_eq!(search.query_until("ab-NO-MATCH-"), Err(2)); // No match on byte at index 2.

Features

  • Generic type support: As the above examples show, trie-rs can be used for searching not only UTF-8 string but also other data types.
  • Based on louds-rs, which is fast, parallelized, and memory efficient.
  • Latest benchmark results are always accessible: trie-rs is continuously benchmarked in Travis CI using Criterion.rs. Graphical benchmark results are published here.
  • map::Trie associates a Value with each entry.
  • Value does not require any traits.
  • Label: Clone not required to create Trie<Label> but useful for many reifying search operations like predictive_search().
  • Many search operations are implemented via iterators which are lazy, require less memory, and can be short circuited.
  • Incremental search available for "online" applications, i.e., searching one Label at a time.

Cargo features

  • "rayon"

Enables rayon a data parallelism library.

  • "mem_dbg"

Can determine the size in bytes of nested data structures like the trie itself.

  • "serde"

Can serialize and deserialize the trie.

Acknowledgments

edict.furigana is used for benchmark. This file is constructed in the following step:

  1. Download edict.gz from EDICT.
  2. Convert it from original EUC into UTF-8.
  3. Translate it into CSV file with edict-to-csv.
  4. Extract field $1 for Hiragana/Katakana words, and field $3 for other (like Kanji) words.
  5. Translate Katakana into Hiragana with kana2hira.

Many thanks for these dictionaries and tools.

Versions

trie-rs uses semantic versioning.

Since current major version is 0, minor version update might involve breaking public API change (although it is carefully avoided).

Rust Version Supports

trie-rs is continuously tested with these Rust versions in with the github CI:

  • 1.75.0 with all features
  • 1.67.0 with no features
  • Latest stable version

So it is expected to work with Rust 1.75.0 and any newer versions.

Older versions may also work but are not tested or guaranteed.

Earlier Rust Verion Supports

If support for Rust prior to 1.67.0 is required, trie-rs 0.2.0 supports Rust 1.33.0 and later.

Contributing

Any kind of pull requests are appreciated.

License

MIT OR Apache-2.0

trie-rs's People

Contributors

laysakura avatar lucacappelletti94 avatar shanecelis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

trie-rs's Issues

ability to clone Tries?

It looks like this isn't possible at the moment? Would really appreciate it if Trie implemented Clone

Serde support

Is there any reason why serde is not supported as an optional feature?

If no, I will be adding some derives in a pull request. Kindly let me know.

Luca

private type `KeyValue<K, V>` in public interface

In rust 1.73.0, I get this compiler error:

error[E0446]: private type `KeyValue<K, V>` in public interface
  --> trie-rs-0.2.0/src/map.rs:26:1
   |
9  | struct KeyValue<K,V>(K,
   | -------------------- `KeyValue<K, V>` declared as private
...
26 | impl<K: Clone, V: Clone> Trie<K,V> where KeyValue<K,V>: Ord + Clone {
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can't leak private type

Serialization/deserialization not working properly

Steps to reproduce:

  1. Downloads words.txt
fn main() {
    let mut t = TrieBuilder::new();
    for word in std::fs::read_to_string("words.txt")
        .unwrap()
        .split_whitespace()
    {
        if !word.trim().is_empty() {
            t.push(word);
        }
    }
    let t = t.build();
    println!("Loaded");
    let mut f = File::create_new("words.toml").unwrap();
    write!(f, "{}", toml::to_string(&t).unwrap()).unwrap();
}
Loaded
thread 'main' panicked at src/main.rs:57:41:
called `Result::unwrap()` on an `Err` value: Error { inner: UnsupportedType(Some("unit")) }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Other libraries like postcard also fail, but with deserialization failing instead.

Reduce memory usage

Now that we have mem_dbg present, we can determine the memory usage. And presently the trie is larger than the dictionary in memory, which surprised me.

$ cargo test --features mem_dbg memsize -- --nocapture
running 1 test
Reading dictionary file from: /Users/shane/Projects/trie-rs/benches/edict.furigana
Read 185536 words, 3531819 bytes.
Trie size 4174393
Uncompressed size 7984707
test trie::trie_impl::search_tests::memsize ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 114 filtered out; finished in 7.17s

I fear it may be because of adding the value on the map. My idea for tackling that is to try changing the TrieLabel into an enum instead of a struct. Then a TrieLabel::Value will mean its parent is the last Label of that entry.

Number of elements in trie

Hi, I am not sure whether there is a method in the trie that returns the number of elements in the trie. Is there?

Switching from Travis to Github Actions

Hi - I noticed that the Travis CI link is broken. I had several of those myself since when Travis switched to a premium-only CI.

If that is okay for you, I will make a pull request changing the travis setup for a GitHub actions setup.

Best,
Luca

Slow predictive_search() compared to common_prefix_search()

Many not use recursion.

Benchmarking [302b8e4] Trie::predictive_search() 100 times
Benchmarking [302b8e4] Trie::predictive_search() 100 times: Warming up for 1.0000 s
Benchmarking [302b8e4] Trie::predictive_search() 100 times: Collecting 10 samples in estimated 5.2899 s (275 iterations)
Benchmarking [302b8e4] Trie::predictive_search() 100 times: Analyzing
[302b8e4] Trie::predictive_search() 100 times
                        time:   [18.633 ms 18.806 ms 19.033 ms]
                        change: [-0.8818% +0.9228% +3.2224%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking [302b8e4] Trie::common_prefix_search() 100 times
Benchmarking [302b8e4] Trie::common_prefix_search() 100 times: Warming up for 1.0000 s
Benchmarking [302b8e4] Trie::common_prefix_search() 100 times: Collecting 10 samples in estimated 5.1579 s (935 iterations)
Benchmarking [302b8e4] Trie::common_prefix_search() 100 times: Analyzing
[302b8e4] Trie::common_prefix_search() 100 times
                        time:   [5.3941 ms 5.4524 ms 5.5205 ms]
                        change: [-7.0522% -2.9814% +0.9275%] (p = 0.21 > 0.05)
                        No change in performance detected.

Map values dropped if longer path is added first

It seems that constructing a trie will only work correctly if values are inserted starting with the shortest path. Here is a modified code example from the documentation:

use trie_rs::map::Trie;

let trie = Trie::from_iter([("a", 0), ("app", 1), ("apple", 2)]);
let results: Vec<(String, &u8)> = trie.iter().collect();
assert_eq!(results, [("a".to_string(), &0u8), ("app".to_string(), &1u8), ("apple".to_string(), &2u8)]);

let trie = Trie::from_iter([("a", 0), ("apple", 2), ("app", 1)]);
let results: Vec<(String, &u8)> = trie.iter().collect();
assert_eq!(results, [("a".to_string(), &0u8), ("app".to_string(), &1u8), ("apple".to_string(), &2u8)]);

The first assert succeeds, the second fails – app is not present, it was dropped because apple was already present when it was supposed to be added.

Change source of GitHub pages

Hi Sho,

@LucaCappelletti94 got the ball rolling on converting the travis CI to Github actions. (Thanks, Luca!) I've given it another push to get the documentation and benchmarks generated as well. In doing so it'll be helpful to adjust one setting on the trie-rs repository. We'd like to populate the trie-rs' pages from GitHub actions instead of the gh-pages branch. To change this, go to trie-rs Settings, then Pages, then set the source to Github Actions. That should be it. Thanks, Sho.

-Shane

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.