logannc / fuzzywuzzy-rs Goto Github PK
View Code? Open in Web Editor NEWport of https://github.com/seatgeek/fuzzywuzzy
License: GNU General Public License v2.0
port of https://github.com/seatgeek/fuzzywuzzy
License: GNU General Public License v2.0
The only standard library uses we currently have are HashSet
, for which there are excellent replacements, and std::cmp::{min, max}
which are obviously functions we can live without.
The original's description is:
Fuzzy String Matching in Python
What are your thoughts on changing ours' to the following:
Fuzzy String Matching in Rust, a port of the Python package fuzzywuzzy.
For topics, here's a list of ones off the top of my head. These could be updated in the Cargo.toml
for crates.io, as well:
Opened based on the discussion in #23 and in this comment.
"Well, I'm dissatisfied with the options available for handling unicode. In the same way we allow alternative scorers, we should allow callers to choose how unicode is handled.
You could imagine a few different strategies: byte-level comparisons (essentially our original implementation sans panics), this new approach which is mostly at the 'char' level (need to double check if there are any bits that are still byte-level like match_indices), or one based on unicode normalization of grapheme clusters."
How about MIT?
See rapidfuzz/RapidFuzz#13 for generous details from another fuzzywuzzy compatible project author.
Essentially, partial_ratio
attempts to align strings optimally, then take the ratio of the aligned string subsets. The method of alignment used in fuzzywuzzy is.. pretty bad, actually.
Implement Smith-Waterman for sequence alignment to power partial_ratio
. Legacy partial_ratio
behaviour will be relegated to a compatibility function (which, if there are more, might be put into a compatibility module).
I'd like to convert this function to use MatchingStreak's internally.
It might make it more clear to be comparing low1 < streak.idx1 instead of low1 < i
Edit: Please choose the appropriate labels, I'm not sure whether this is a bug or an enhancement on top of the upstream's algorithm?
https://blog.rust-lang.org/2020/11/19/Rust-1.48.html#easier-linking-in-rustdoc
pub mod foo {
/// Some docs for `Foo`
///
/// You may want to use `Foo` with [`Bar`].
///
/// [`Bar`]: ../bar/struct.Bar.html
pub struct Foo;
}
pub mod bar {
/// Some docs for `Bar`
///
/// You may want to use `Bar` with [`Foo`].
///
/// [`Foo`]: ../foo/struct.Foo.html
pub struct Bar;
}
We do reference other functions and things in docs, it would be good to add linkage.
The phrase has been used as a derogatory term to describe a black person. [2] The term "Fuzzy Wuzzy Angels" was used by Australian soldiers during World War II to describe Papua New Guinean stretcher bearers. The term was not widely deemed to be problematic when it was used by Rudyard Kipling and British soldiers during the Sudan Campaign or by Australian soldiers in the 20th century, however many contemporary commentators deem it to be a racist slur. [3]
https://en.wikipedia.org/wiki/Fuzzy-Wuzzy
https://www.dictionary.com/browse/fuzzy-wuzzy
Hey Logan - I was looking for a port for fuzzywuzzy, and after some google landed here. Any thoughts on the future of this crate?
Cheers,
SG
The wisdom for a while has been "anyhow" for binaries, "thiserror" for libraries. The macros from thiserror are simply and non-magic enough that I feel comfortable using them normally.
Originally posted by @seanpianka in #28 (comment)
https://github.com/dtolnay/thiserror
we have a pretty small error api surface but might as well
I've added a Reddit post here to try and give visibility to the crate. Hopefully Google will recommend the crate in any "rust fuzzy string matching" query!
@logannc, I think we should decide on the public API to go forward with, w.r.t. your thoughts here. We should stabilize on a public API and properly version it at 1.0.0 and beyond.
Thinking about the changes that must made before bumping to 1.0.0, I see this list remains:
process
moduleextract_*
function implementationsI think once we reach feature parity with the original fuzzywuzzy, we can bump the version. Thoughts?
I am so far unable to contact the owner of the crate on crates.io. The staff of crates.io has suggested I pick a different name if I'm unable to reach them.
If I don't hear back from @JacksonGariety in the next week, I'll change the name here to fuzzywuzzy and publish under that.
essentially all of the algorithms in this crate are poorly suited to unicode because they iterate over the char
s in the string instead of the graheme clusters.
https://crates.io/crates/unicode-segmentation is the semi-official rust crate for unicode segmentation.
I don't have a good option for detecting homoglyphs yet. But homoglyph detection / custom equality on the clusters in combination with all of the usual algorithms should be what we need for full support.
Missing the following methods from the original process implementation:
extract
extractBests
dedupe
See the original implementation where the processor function is only executed once if it's a known method.
I'm not sure if this is an optimization to prevent duplicate calls, or if multiple invocations could lead to some undesired state? i.e. the processor function is not idempotent.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.