Giter VIP home page Giter VIP logo

fuzzywuzzy-rs's People

Contributors

iwahbe avatar logannc avatar seanpianka avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fuzzywuzzy-rs's Issues

Update repository description and topics

The original's description is:

Fuzzy String Matching in Python

What are your thoughts on changing ours' to the following:

Fuzzy String Matching in Rust, a port of the Python package fuzzywuzzy.

For topics, here's a list of ones off the top of my head. These could be updated in the Cargo.toml for crates.io, as well:

  • string
  • text
  • processing
  • utility
  • matching
  • fuzzy
  • wuzzy
  • extract
  • deduplicate
  • duplicates
  • text-processing

Support specifying alternative strategies for handling unicode

Opened based on the discussion in #23 and in this comment.

"Well, I'm dissatisfied with the options available for handling unicode. In the same way we allow alternative scorers, we should allow callers to choose how unicode is handled.

You could imagine a few different strategies: byte-level comparisons (essentially our original implementation sans panics), this new approach which is mostly at the 'char' level (need to double check if there are any bits that are still byte-level like match_indices), or one based on unicode normalization of grapheme clusters."

Implement optimal string alignment algorithm

See rapidfuzz/RapidFuzz#13 for generous details from another fuzzywuzzy compatible project author.

Essentially, partial_ratio attempts to align strings optimally, then take the ratio of the aligned string subsets. The method of alignment used in fuzzywuzzy is.. pretty bad, actually.

Implement Smith-Waterman for sequence alignment to power partial_ratio. Legacy partial_ratio behaviour will be relegated to a compatibility function (which, if there are more, might be put into a compatibility module).

get_matching_blocks should use MatchingStreaks

Based on #26 (hash)

I'd like to convert this function to use MatchingStreak's internally.
It might make it more clear to be comparing low1 < streak.idx1 instead of low1 < i

Edit: Please choose the appropriate labels, I'm not sure whether this is a bug or an enhancement on top of the upstream's algorithm?

Docs linkage

https://blog.rust-lang.org/2020/11/19/Rust-1.48.html#easier-linking-in-rustdoc

pub mod foo {
    /// Some docs for `Foo`
    ///
    /// You may want to use `Foo` with [`Bar`].
    ///
    /// [`Bar`]: ../bar/struct.Bar.html
    pub struct Foo;
}

pub mod bar {
    /// Some docs for `Bar`
    ///
    /// You may want to use `Bar` with [`Foo`].
    ///
    /// [`Foo`]: ../foo/struct.Foo.html
    pub struct Bar;
}

We do reference other functions and things in docs, it would be good to add linkage.

The name of this library might need a rethink

The phrase has been used as a derogatory term to describe a black person. [2] The term "Fuzzy Wuzzy Angels" was used by Australian soldiers during World War II to describe Papua New Guinean stretcher bearers. The term was not widely deemed to be problematic when it was used by Rudyard Kipling and British soldiers during the Sudan Campaign or by Australian soldiers in the 20th century, however many contemporary commentators deem it to be a racist slur. [3]

https://en.wikipedia.org/wiki/Fuzzy-Wuzzy
https://www.dictionary.com/browse/fuzzy-wuzzy

fuzzyrusty

Hey Logan - I was looking for a port for fuzzywuzzy, and after some google landed here. Any thoughts on the future of this crate?

Cheers,
SG

Road to 1.0.0

I've added a Reddit post here to try and give visibility to the crate. Hopefully Google will recommend the crate in any "rust fuzzy string matching" query!

@logannc, I think we should decide on the public API to go forward with, w.r.t. your thoughts here. We should stabilize on a public API and properly version it at 1.0.0 and beyond.

Thinking about the changes that must made before bumping to 1.0.0, I see this list remains:

  • #15 - Explicit type definitions for matches and scoring (or is this not necessary?)
  • #19 - String alignment algorithm
  • #20 - Unicode support
  • #7 - Missing methods from the original implementation of the process module
  • #6 - Only run processor methods once in the extract_* function implementations

I think once we reach feature parity with the original fuzzywuzzy, we can bump the version. Thoughts?

Change Crate Name

I am so far unable to contact the owner of the crate on crates.io. The staff of crates.io has suggested I pick a different name if I'm unable to reach them.

If I don't hear back from @JacksonGariety in the next week, I'll change the name here to fuzzywuzzy and publish under that.

Add Opt-In Unicode Support

essentially all of the algorithms in this crate are poorly suited to unicode because they iterate over the chars in the string instead of the graheme clusters.

https://crates.io/crates/unicode-segmentation is the semi-official rust crate for unicode segmentation.
I don't have a good option for detecting homoglyphs yet. But homoglyph detection / custom equality on the clusters in combination with all of the usual algorithms should be what we need for full support.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.