Comments (4)
I could also have a running window of len(word) letters over the doc and match against that, which should work but seems terribly inefficient.
How big is the document? This shouldn't be THAT slow.
from fuzzywuzzy.
In practice I can go with the running window, I just wondered
- Is there a better way? (OK, so probably not then)
- Is this a feature that you think suits this module? Are you interested in a pull request?
from fuzzywuzzy.
- Probably not.
- Yes, in a separate module, with tests.
from fuzzywuzzy.
Speaking generally, it probably depends how often you intend to do this. If you have a lot of documents it may be worth looking into a bag-of-words approach using TF-IDF to index n-grams and search your documents. This would account for garbled text and misspellings (e.g. a search for "schwarzeneger" should yield the correct "schwarzenegger" ).
There are a number of modules in python that would allow you to do this, some more suitable than others depending on your application. For example, whoosh seems to have a number of features that allow you to do what you describe above.
from fuzzywuzzy.
Related Issues (20)
- Thank you for the fuzzywyzzy!
- `process.dedupe()` gives IndexError: list index out of range because of bug in `process.extractWithoutOrder()`
- Missing functions after import. HOT 2
- What is the max possible value (upper bound) for fuzz.ratio? HOT 4
- Measuring Small changes over large documents HOT 1
- Wired behavior of partial_ratio HOT 1
- process.extract broken in fuzzywuzzy=0.13 HOT 3
- How to compare each and every row with every row in same column and delete matching rows with ratio > 90
- String fuzzy-matching From R to Python HOT 1
- Installing python-Levenshtein as suggested by the warnings gives different results. HOT 1
- utils.full_process executed when processor=None HOT 1
- Please rename this package to "FuzzyMatch" or similar. HOT 2
- Search for matches in an array of complex objects.
- Mark repository as archived
- token_set_ratio Degenerate Case
- 'list' object has no attribute 'items'
- How to decrease False positive matches? (process.extract / WRatio) HOT 3
- NameError: name 'ratio' is not defined HOT 3
- license issue HOT 3
- problem of fuzz.ratio with newer ver (22) of python-Lev. distance HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fuzzywuzzy.