Giter VIP home page Giter VIP logo

Comments (4)

netAction avatar netAction commented on August 19, 2024 3

Maybe the current algorithm is not that bad. But it might be better to split the words and search with any order. Treating spaces like any other character is a bad idea no matter what algorithm is used.

from fuse.

krisk avatar krisk commented on August 19, 2024

Interesting. Let me dig into that a little more.

from fuse.

jeancroy avatar jeancroy commented on August 19, 2024

That is an interesting question.

Part of the answer is that when netAction says uni is much closer to university than FH, he intuitively use a metric close to the number of common character in proper order (length of LCS, llcs). However bitap algorithm measure edit distance. To compare the two metrics let's look at this example.

Matches are indicated by | and simple edit distance (insert/delete) by -. In the following example we have 5 matches and a distance of 3.

sur-ve-y
|||  | |
surg-ery

Levenshtein edit distance would have grouped -v and g- as a single substitution operation : , giving a distance of 2

surve-y
|||:| |
surgery

Now, let's look at the case that started this thread

uni-------    uni
|||           ::
university    FH-

uni vs university: 3 matches, but 7 insertion away
uni vs FH: 0 matches, but 2 sub and 1 del away.

Now it's true that finding the sequence of character that have the maximum match, or minimum error distance, are equivalent problem while comparing whole string A to whole string B.

But given a list of possible matches, result with the most matches (LLCS) and result with the least distance are not the same, especially when the item have different length. There's a lot of subtle difference between those two metrics, especially when matching only part of A against part of B.

Computer Science usually think in term of edit distance for things like spell checker, index, and file diff. But Computational Biology think in term of LCS to match strings representation of things like protein or DNA. One interesting thing about Computational Biology is that they are mainly interested in finding alignment (like the figures I used above) so they have developed solutions to the highly desirable, as it has been said, highlight feature.

My theory is that LLCS is well suited for auto-complete while distance might be better for auto-correct
For example, in an auto-complete scenario the part that will be completed should not introduce a large penalty (...versity, incur a penalty of 7 and prevent uni to match university !!)

To test this, I have started by writing to write a fuse searcher, using a nice nice O(m+n) bit-parallel LCS algorithm to replace bitap O(m*n) edit distance. (Since then, it has now become it's own project without dependency). Project also support free word ordering and result highlight (including free word order on highlight), and fallback to another algorithm when bit representation do not fit in a int32.

I have made a demo here

To compare I used fuse book dataset as the small dataset (now would be the right time to ask permission to use it, if not a bit late). The difference between O(m+n) vs O(m*n) can be seen comparing to timed fuse example. They are a bit apple to orange comparison, but in fuse example there is a steady increase of about 1ms per character, which is about the time needed to do one pass over the data. This demo algorithm do not show this behaviour (or will show a steady increase per word instead of per character because I do one pass for each token)

PS: This is my first try at releasing an open source project so comment are more than welcome.

from fuse.

jeancroy avatar jeancroy commented on August 19, 2024

Oh it can definitively be done. It would speed up starting position search, and slow down matching each word of query against each word or item.

It's just there's quite a lot of plumbing required.

split query into space separated words.
this.pattern, this.patternAlphabet would become array of patterns, patternAlphabets

analyzeText should split item value into words, apply searcher.search on each of those words, collate best word match in each case.

search could either manage array of patterns itself or be given a single pattern / alphabet as an argument and be looped by it's caller. (I opted for second option, because I have found that foreach query word, find best item word, add best score result are better that foreach item word find best query word, add best score)

lastly this update would probably change good answer of some unit test

from fuse.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.