I need to fuzzy find a string (one word) in a lightly garbled document (OCR:ed with ma

Probably not. Yes, in a separate module, with tests.

Fuzzy find string in document about fuzzywuzzy HOT 4 OPEN

seatgeek commented on June 23, 2024

Fuzzy find string in document

from fuzzywuzzy.

Comments (4)

bernardosulzbach commented on June 23, 2024

I could also have a running window of len(word) letters over the doc and match against that, which should work but seems terribly inefficient.

How big is the document? This shouldn't be THAT slow.

from fuzzywuzzy.

gurgeh commented on June 23, 2024

In practice I can go with the running window, I just wondered

Is there a better way? (OK, so probably not then)
Is this a feature that you think suits this module? Are you interested in a pull request?

from fuzzywuzzy.

josegonzalez commented on June 23, 2024

Probably not.
Yes, in a separate module, with tests.

from fuzzywuzzy.

harrisniall commented on June 23, 2024

Speaking generally, it probably depends how often you intend to do this. If you have a lot of documents it may be worth looking into a bag-of-words approach using TF-IDF to index n-grams and search your documents. This would account for garbled text and misspellings (e.g. a search for "schwarzeneger" should yield the correct "schwarzenegger" ).

There are a number of modules in python that would allow you to do this, some more suitable than others depending on your application. For example, whoosh seems to have a number of features that allow you to do what you describe above.

from fuzzywuzzy.

Fuzzy find string in document about fuzzywuzzy HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent