Giter VIP home page Giter VIP logo

ve-rs's Introduction

Ve for Rust

This is a port of the ruby gem https://github.com/Kimtaro/ve to Rust. ๐Ÿฆ€

The Rust version is meant to be used with https://github.com/daac-tools/vibrato/, a great and blazingly fast mecab-compatible tokenizer, and an IPADIC dictionary which can be found in the same repo (under Releases).

(Support for other tokenizer crates like Lindera planned as well, should be easy)

I'm trying to mostly stay close to the original codebase, and use Rust ways and idioms where applicable.

In the future I'm planning to add tests comparing outputs from the ruby and rust version, making sure there's no unexpected differences in logic.

Getting started

You can play around with the library as-is simply by cloning the repo and cargo running it. This will tokenize an example string using vibrato and then postprocess the tokens to return a more meaningful array of words.

The example code also shows in a simple way how to use this crate in your own application, provided that you're working with vibrato for tokenization.

ve-rs's People

Contributors

jannisbecker avatar

Stargazers

ใ‚ซใ‚ทใ‚ชใ€€้‡‘ๅŸŽใ€€ๅคง้–ข avatar Natalia Cholewa avatar Anatoly Chernov avatar

Watchers

 avatar

ve-rs's Issues

Condition in `sanitize_asterisk` is incorrect resulting in lemma being discarded

Hi, firstly thank you for this crate, it has been useful for something I'm working on :)

I just wanted to point out that I believe the condition in the sanitize_asterisk function is the wrong way around.

The current code is as follows:

fn sanitize_asterisk(value: &str) -> Option<String> {
    if value.is_empty() || value == "*" {
        Some(value.into())
    } else {
        None
    }
}

To my understanding, an '*' indicates a lack of a relevant value in the MeCab format, which we would presumably want to represent as None. If not an empty string or '*', then presumably we want to keep the lemma. Instead, this code does the opposite.

I changed it to this (the same but shorter and with the condition is inverted):

fn sanitize_asterisk(value: &str) -> Option<String> {
    (!value.is_empty() && value != "*").then(|| value.to_string())
}

Now for example entering the input โ€™้›ฃใ—ใ‹ใฃใŸ' results in a Word instance where lemma is Some("้›ฃใ—ใ„".to_string()) whereas before my change it was just None.

Please let me know if I'm missing something, I'm not super familiar with MeCab and such.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.