Comments (7)
For distance=2, length=n my approach gives average O(n) solution, not O(n^2). I generate insertions at distance=1 (linear), and i go to level 2 only if my bloom-filter say that there is such delete there. So I don't see why it should be significantly reduced.
In the best case (assuming there is at least one suggestion within MaxEditDistance), you would still have n*26 + (n+1)*26 insertions/dictionary lookups per delete. Those minimum 269 dictionary lookups per delete (n=5, distance=2) is one reason for a performance reduction. Even if not visible in Big O notation, constant factors may heavily influence performance. The second reason is that probably multiple level1 deletes will exist, and you will have to to iterate level2 multiple times as well.
Also my benchmarks performed in python, which is rather slow.
That's why I benchmarked a c# port of Norvig's algorithm against the c# implementation of SymSpell.
That should exclude the performance impact of the implementation language, and just compare the algorithmical difference.
from jamspell.
It use n-gram language model (word based, 3-gram), and select candidate with the highest score. Also it is optimized for speed (modified symspell algorithm) and memory consumption (bloom-filter & perfect hash). I will add some description in README later. Here is an article (russian) with detailed explanation, habrahabr.ru/post/346618.
from jamspell.
Спасибо, надо почитать.
Closing this issue.
It is also interesting to see if JamSpell is going to have a "hyped", yet more related n-gram language model in some RNN architecture.
P.S.: Your library is popular and most recommended for Spell checking problem in ODS #nlp channel community
from jamspell.
It is also interesting to see if JamSpell is going to have a "hyped", yet more related n-gram language model in some RNN architecture.
I want to try LSTM in the future.
from jamspell.
Hi, this is Wolf, the author of the original SymSpell algorithm. Unfortunately, my Russian is limited, so I used Google Translate to read your interesting habrahabr post. I hope I got it right:
"... the index from the SymSpell algorithm took up a lot of space. ... But if a bloom filter says that such a deletion is in the index - we can restore the original word by performing insertions to the delete and checking them in the index. The performance of the resulting solution has practically not slowed down, and the memory used has decreased very significantly. ..."
While a bloom filter of deletes indeed takes much less space than storing the deletes and the pointers to the original words, I believe that the performance will be significantly reduced by performing the insertions.
A single delete of length=n with maximum edit distance=2 requires 26(n+1)26(n+2) insertions and checks in the dictionary (for 26 letters in latin alphabet). For length=5 this results in 28,392 dictionary lookups per delete. And you have n(n-1)/2 deletes for each word for edit distance=2.
Also, the algorithm becomes language dependent, as the character set of the characters to be inserted is different in Latin, Cyrillic, Georgian, Tibetan or Chinese
In your tests, JamSpell seems to be 3..4 times faster than Norvig. In my benchmark the original SymSpell is 1000x faster than Norvig for maximum edit distance=2.
Btw, the memory usage of the recent SymSpell versions has been significantly reduced by prefix indexing.
from jamspell.
Hi, thanks a lot for your feedback! In my implementation the bottleneck now is generating sentences with candidates and getting language model predictions. Originally I was using Norvig's approach, but it was very slow, especially on long words. Your algorithm helped me to improve performance.
A single delete of length=n with maximum edit distance=2 requires 26(n+1)26(n+2) insertions and checks in the dictionary (for 26 letters in latin alphabet). For length=5 this results in 28,392 dictionary lookups per delete. And you have n(n-1)/2 deletes for each word for edit distance=2.
For distance=2, length=n my approach gives average O(n) solution, not O(n^2). I generate insertions at distance=1 (linear), and i go to level 2 only if my bloom-filter say that there is such delete there. So I don't see why it should be significantly reduced.
Also, the algorithm becomes language dependent, as the character set of the characters to be inserted is different in Latin, Cyrillic, Georgian, Tibetan or Chinese
Agree, it could be an issue for languages with lot's of characters.
In your tests, JamSpell seems to be 3..4 times faster than Norvig. In my benchmark the original SymSpell is 1000x faster than Norvig for maximum edit distance=2.
I think it's not correct to compare your library with my, your library doesn't consider words surroundings. Also my benchmarks performed in python, which is rather slow. I would be glad to add symspell to my benchmarks too, but, as far as I know - it don't have any python bindings.
Btw, the memory usage of the recent SymSpell versions has been significantly reduced by prefix indexing.
I will look at it. I tried to use suffix tree - it reduced memory usage but it still required a lot of it. Bloom filter is much more compact.
from jamspell.
Can we use some neural language model instead of n-gram language model for candidate suggestions?
from jamspell.
Related Issues (20)
- unable to install in mac and also in colab HOT 2
- Grammar check HOT 1
- Hangs up when calling TSpellCorrector.LoadLangModel HOT 8
- Train custom model on Windows
- Spanish model HOT 2
- 4Gb of RAM on 1K-characters word HOT 4
- GetCandidates shows candidates but FixFragment doesn't correct the sentence HOT 3
- evaluate.py stops itself while running HOT 1
- Update to SWIG 4 HOT 7
- Jamspell Pro HOT 3
- Unable to install JamSpell via python3.8.5 even after installing vsc++ and swig. HOT 1
- training on large data set is failing HOT 3
- try to install to windows but not running HOT 1
- What are minimum requirements? HOT 1
- Dockerizing Jamspell web app HOT 1
- How to train custom model? HOT 1
- Segmentation fault with JamspellPro
- Minimum token repetition rate for training
- Is it yet alive? HOT 2
- Exposed bing api key in Jamspell pro
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jamspell.