aarondandy / wecantspell.hunspell Goto Github PK
View Code? Open in Web Editor NEWA port of Hunspell v1 for .NET and .NET Standard
Home Page: https://www.nuget.org/packages/WeCantSpell.Hunspell/
License: Other
A port of Hunspell v1 for .NET and .NET Standard
Home Page: https://www.nuget.org/packages/WeCantSpell.Hunspell/
License: Other
When the issue hunspell/hunspell#760 is resolved, restore the tests for allcaps.aff . Also can_read_allcaps_dic will need tests added manually for the new words.
Do you have an easy way to tip/donate? Our company found this solution incredibly helpful 👍
Thanks for taking the time to create and maintain it ^_^
I've found a few cases where a misspelled word is suggested. This issue does not occur in the native C++ version.
But if I take a word like abjurers and put a typo in so that it is now abmurers, one of the suggested words is abjureers, which is not an English word. Here are a few more examples:
Word | Word Misspelled | Bad Suggestion |
---|---|---|
epoxied | epooied | poiseed |
jewelries | ewelries | jewelrys |
squabbles | suabbles | squabblees |
The file loading performance is pretty terrible, maybe pipelines will help a bit? Then again, maybe not...
Like the "NetCore" projects for example.
This is more of a question, but I'd like to use this in a project I'm working on. From what I can tell WordList.Check is designed to check single words.
Are there any recommendations on what tool to use that I can break up sentences into words, etc., that should be checked? A naive way would be to just use string.split(), but I'd like to see if there's a tool that can automatically handle numbers, currency, sentence puncuation. I've been looking at some NLP tools but wondering if you've used anything in particular.
See if the new ref features in C# 7 can help with performance and reduce copies in the code.
I am trying to make a spell checker for Kurdish. The problem is, Kurdish relies a lot on infixes (mostly because of clitic pronouns). I'd appreciate it if you provide any guidance on what's the best approach for a language like that.
That's great news, I'd rather create a Nuspell dictionary than a custom library on my own.
I have noticed that Hunspell uses very little memory and is quite fast. So if I want to create a custom library for Kurdish, I want to know which algorithms Hunspell uses.
Here is a general idea of what I am trying to accomplish:
Consider this word: Bexshin (Forgiving), It can come in these forms:
Instead of a list of words, we can have a list of patterns like so: Bi{pronoun}bexsh{pronoun}
More Examples:
Eat (Dexo{pronoun}
)
Can be represented as:
Work (Kar{pronoun}dekrid
)
So I need an algorithm to very quickly tell me what are the closest matching patterns, and then I can expand only those patterns and based on the Levenshtein distance to the input word give back a list of suggestions.
I know that I can read the source code, and I will. But it'd make my job much easier if you gave me a few leads on which algorithms can be useful based on your experience.
Some of the build targets may be redundant. Will have to do some experimentation with different platforms to see what builds NuGet selects for them:
The 451 build should probably target 45, and the 461 build may not even need to exist. The netstandard1.1 build may be redundant with the PCL build. The PCL build may be harder to build in the future though... I'm not sure if people would even use the PCL or netstandard1.1 builds in the future.
I'm trying to use WeCantSpell to create an autocorrect feature for a project I'm working on. I call WordsList.Suggest every time a new letter is added/removed from the word I'm writing but the results are generated very slowly, the more letters i feed into the method, the more time it takes to compute a result. Is there any way I can make it run faster? Is there a way to limit the number of words to retrieve as suggestions?
I'm trying to use this package in a UWP app, but it works only when building the project in Debug mode.
When building in Release mode the build fails with the following error:
ILT0038: 'WeCantSpell.Hunspell.WordEntryDetail' is a value type with a default constructor. Value types with default constructors are not currently supported. Consider using an explicit initialization function instead.
I believe that to resolve this error the default empty CTOR in the WordEntryDetail class needs to be removed.
As mentioned in #5 the HunspellDictionary
class does not really do much. Is it helping or hurting? Maybe there is a better design that could be thought up here.
StringSlice is over. Span is what's hot.
Because NetCore is not really a good library target anymore and because NetStandard may itself be a moving target (who knows?) I am going to rename this repository. The whole reason I started work on this port was to support another spelling related project I am working on. Because of this I think I will just turn that name (WeCantSpell) into an umbrella project containing both this Hunspell port as well as the tool that would consume it. Naming is hard, and ... whatever I'm tired of trying to pick a good one :) Nothing with respect to functionality will change but people referencing the pre-release packages will have to swap out some namespaces and a package.
Hello,
I didn't find any ticket or test about adding custom words to a loaded dictionary.
Is it possible?
Or do it need to re-create a new dictionary by merging?
Thanks,
Hervé
The method WordList.QuerySuggest.Phonet
has two nested loops that query for entries from a PhoneTable
that have rules starting with a given character. Indexing these entries in the table by the first rule character may have a positive impact on performance.
During the initial developement of this port, a regular old dictionary of type Dictionary<string, T>
was used to store various words, affixes, and their associated details. This really helped speed up the development of the code initially and even has some performance strengths in specific cases.
It is clear to me now that the choice to use Dictionary<,>
was probably wrong for a lot of things as it is not a very good data structure when you want multiple related results for a query. When profiled this is where the code spends most of its time.
I would like to see how the use of a trie or a sorted collection impacts performance in the library. Looking back at the Hunspell source and some e-mails discussing design I think the original source uses something like a sorted linked list as the main storage for root words.
The following locations in the code may benefit from swapping out and utilizing a better datastructure:
The methods SuffixCollection.GetMatchingAffixes
and PrefixCollection.GetMatchingAffixes
both look for affixes that begin or "end" with a certain string of text. This part of the code could probably benefit greatly from having a list of Affix<>
that can be indexed by word instead of having all AffixEntries
being confusingly nested into groups. It may also create a case for the internal Affix<>
type to be converted to a reference type. The GetMatchingWithDotAffixes
method is also realated but is often a cold path so optimiazation there should be focused on code size.
The method MultiReplacementTable.FindLargestMatchingConversion
while small may make many calls to a dictionary as it is a loop that is itself called from within a loop. It's responsibility is to find the longest matching entry for a substring.
The method PatternSet.Check
searches for a pattern entry that has a text value that is ia subset of another.
While I'm not sure that a trie or even a sorted list would have an impact on the WordList.EntriesByRoot
collection, the entries beind sorted may have a beneficial impact of keeping related roots near eachother in memory.
Is there a way to get all the words that start with a specific string?
If so, the license is the GNU LGPL, instead of the v1 tri-license.
Need to get a contributing file setup, maybe a nice .md file to document how things in the port map to origin would help too.
It seems that while Dictionary works it is not a very performant tool to use for this dictionary. Explore other options for both building a word list and later searching a word list.
I am looking at a new major version bump where I plan to remove .NET 3.5 support and add in net standard 2.0 support. Net45 and netstandard1.3 support will remain. I am curious if anybody would be impacted by these changes. I think I am going to do it anyway, but it helps me understand better if it is worth spending energy on fixes on a legacy v2 branch.
Need to get an icon for the project so it can be easier to identify and stand out.
The code may have drifted away from the Hunspell origin... maybe not much due to the v2 work.
I really dislike the methods related to WordList.Query.CompoundCheckWordSearch
and would like to see them cleaned up while preserving existing performance numbers. The method used to be much larger but warm methods where split up for performance reasons to reduce conditional checks and branching. The negative impact of this refactoring is that there are now multiple methods dangling our there. I would really like to see something that reduces the amount of code used to solve this problems, improving readability and maintaining the current performance. If no good solution can be found a refactoring of these methods into a new private type may be beneficial.
CompoundCheckWordSearch
CompoundCheckWordSearchMultiDetailWithWords
CompoundCheckWordSearchMultiDetailScpdFlags
CompoundCheckWordSearchMultiDetail
CompoundCheckWordSearchCompoundOnlyDetailScpd
CompoundCheckWordSearchCompoundOnlyDetail
What i want is simple
I would like to obtain all words that can be composed from the given word
E.g.
make/UAGS
in us.dic file
So i want to obtain all words that can be obtained from this word/suffix combination
e.g. results are : made, making, makes etc
After updating NBench the tests are failing to load NHibernate to run the performance comparisons. Also, while I appreciate what NBench does I want to try out BenchmarkDotNet (sp?) and see if that is easier to run repeatedly and if it can also deliver consistent numbers.
Hello,
I like this library, it is very useful. but a little thing to as if I may to sign the library.
Thanks.
Hi there, is it possible to use this port with unity3d game engine? I tried to implement it, but i failed. I guess problem is mono 2.0/3.5 and unity android platform.
Because this is a port of the original library, I don't want to get too creative with the algorithms. For suggest, especially, these brute forcing nested loops can really hurt as the x64 compilers don't put the same amount of up-front effort into optimizations. Maybe something like Levenshtein distance, which I don't even know if I spelled correctly, could be of use without changing the results?
If something can be improved there, the benefits could impact #33, #40, and #43 .
Learning resources from https://github.com/Turnerj/Quickenshtein#learning-resources
Hello
Here is my code (it's pretty basic) :
var dictionaryFr = WordList.CreateFromFiles(ressources + "\\fr-toutesvariantes.dic", ressources + "\\fr-toutesvariantes.aff");
for (int i = 0; i < 10; i++)
{
List<string> suggestList = dictionaryFr.Suggest("Systemes").ToList();
System.Diagnostics.Debug.WriteLine(suggestList.Count);
}
And this is what I'm getting :
0
3
3
3
3
0
3
0
3
3
So sometimes I get suggestions, sometimes I don't 😢 . Please help !
I'm using the 3.0.1 version of the nuget, and my code run on .Net Framework 4.5
Because the large gap between .NET 4.5 and .NET 6, I'm not sure yet if I want to continue supporting .NET versions that are out of support or that may soon be out of support. I'm doing some quick analysis of what versions should be supported by exploring github and nuget to see what is out there. Based on that, and my tolerance for adding in shims, I think I will be able to determine a new list of target frameworks for a future 4.x release containing possibly breaking changes. If anybody out there has any feedback, and is actually paying attention to this project, please let me know what framework versions you are targeting still.
Hello,
I'm using library for spellcheck and suggestions. Nothing fancy in initialization:
hunspell = WordList.CreateFromStreams(dictionaryStream, affixStream);
hunspell.Suggest(word);
It's running on net6.0: <TargetFrameworks>netstandard2.0;net6.0</TargetFrameworks>
Here are my affix file and dictionary file en.zip (pretty sure I've got it from open sources, so it's safe to share)
Sometimes I get exception: System.IndexOutOfRangeException
when trying to call .Suggest()
I can't reproduce locally, for given inputs and outputs that produce exception. This is happened in production/stage system over multiple days. For past 14 days and 400k calls, it happened 7 times so far.
Here is a stack trace that I received:
[14:53:56.828](154) Exception: System.IndexOutOfRangeException: Index was outside the bounds of the array.
at WeCantSpell.Hunspell.WordList.QuerySuggest.LeftCommonSubstring(String s1, String s2)
at WeCantSpell.Hunspell.WordList.QuerySuggest.NGramSuggest(List`1 wlst, String word, CapitalizationType capType)
at WeCantSpell.Hunspell.WordList.QuerySuggest.Suggest(String word)
It happened for following suggestion requests: nth
, gua
, gua
, colo
, leav
, Jira
, o
I've just updated to version 3.1.1 to see if it will help, and I have high hopes about 4.0. I'll report later with my findings
The method WordList.QuerySuggest.NGram
through the two methods it calls named NGramWeightedSearch
and NGramNonWeightedSearch
do a series of brute force substring checks that are pretty expensive. These two methods could likely benefit from using some kind of a better algorithm for these contains checks, even if extra allocations may be required, This will hopefully have a positive impact on the suggest performance which is pretty bad at the moment.
We are using your tool in my enterprise, and automatically testing in a pool of computers.
We've found that the first search algorithm always fails in some computers, making our tests flaky. At first, we thought it may be related to work load on those machines, but tinkering with limit time parameters didn't change a thing. In the end, we realized it fails always in the same machines, and works fine in the rest.
The only point in common we've found on those machines is that them all have Intel processors from the E5-26xx family (2640, 2650, 2680).
Can you shed some light about this issue, or share a thought?
Thanks.
There are a few places that could really use some updated C# code.
Hello. I have a text: "Hello, my name is Bob. How do you do?"
I use Split(' ');
and do Check
for all words. But how can i ignore: comma, dot and question in Check
method?
Maybe this library have properties for this? Or i must use regex?
(For all I know, this is from hunspell proper, so maybe this feedback is in the wrong place.)
tl;dr- Libraries should be as pure as possible, because purity maximizes flexibility, composability, testability, and greatly decreases the maintenance burden on the library author.
I would suggest an API like:
var checker = new HunspellDictionary(ISet<string> dictionaryWords, string affixContent);
bool notOk = hunspell.Check("teh");
var suggestions = hunspell.Suggest("teh");
bool ok = hunspell.Check("the");
As it is now, this library thinks in terms of filesystems for affix and dictionary files. But what if data source is a database? Or a web service endpoint? This is especially true for people that might consume this in an ASP.NET Core web application.
ISet.Comparer
for GetHashCode
and Equals
implementations where it makes sense to do soA bunch of stuff goes away afterwards:
CulturedStringComparer
Utf16StringLineReader
HunspellLineReaderExtensions
IHunspellLineReader
DynamicEncodingLineReader
StaticEncodingLineReader
HunspellDictionary
System.IO
dependenciesAffixReader
gets considerably simplerThis sets you up to delete (or delegate to the framework) a bunch of other stuff:
CharacterSet
-> HashSet<char>
ArrayWrapper
, ArrayComparer
-> use HashSet<T>
Deduper
EncodingEx
And avoids bugs around:
By maintaining purity, all of your operations are CPU-bound, so the need for async
disappears. (Or rather, it shifts to the application developer who may want to run it on a threadpool thread, but that would be their choice to make.)
If users want to construct a dictionary from another source or an affix from another source it should be much easier to do so. The API should be simplified to a point where one method call can be used to construct a dictionary from a set of words efficiently. There should also be some documentation to show this. From #5
Look to the tests for the implementation and throw something in the readme so others can use more than NetApp1.1 provides.
There are some public methods that are either terrible or not very useful that should be removed. While its for the best, this would be a breaking change and would need to be planned carefully.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.