The wecantspell.hunspell from aarondandy

Restore disabled test: allcaps.aff

When the issue hunspell/hunspell#760 is resolved, restore the tests for allcaps.aff . Also can_read_allcaps_dic will need tests added manually for the new words.

Thank you!

Do you have an easy way to tip/donate? Our company found this solution incredibly helpful 👍

Thanks for taking the time to create and maintain it ^_^

Some suggestions have incorrect spelling

I've found a few cases where a misspelled word is suggested. This issue does not occur in the native C++ version.

But if I take a word like abjurers and put a typo in so that it is now abmurers, one of the suggested words is abjureers, which is not an English word. Here are a few more examples:

Word	Word Misspelled	Bad Suggestion
epoxied	epooied	poiseed
jewelries	ewelries	jewelrys
squabbles	suabbles	squabblees

Pipes, can they help?

The file loading performance is pretty terrible, maybe pipelines will help a bit? Then again, maybe not...

Migrate project to VS2017 Community

Remove old leftover files from the rename and update

Like the "NetCore" projects for example.

Parsing text for individual words

This is more of a question, but I'd like to use this in a project I'm working on. From what I can tell WordList.Check is designed to check single words.

Are there any recommendations on what tool to use that I can break up sentences into words, etc., that should be checked? A naive way would be to just use string.split(), but I'd like to see if there's a tool that can automatically handle numbers, currency, sentence puncuation. I've been looking at some NLP tools but wondering if you've used anything in particular.

Optimize using ref

See if the new ref features in C# 7 can help with performance and reduce copies in the code.

Infix support for Kurdish language

I am trying to make a spell checker for Kurdish. The problem is, Kurdish relies a lot on infixes (mostly because of clitic pronouns). I'd appreciate it if you provide any guidance on what's the best approach for a language like that.

If Nuspell supports Infixes

That's great news, I'd rather create a Nuspell dictionary than a custom library on my own.

If Nuspell doesn't support Infixes and there aren't any reasonable ways to work around that limitation

I have noticed that Hunspell uses very little memory and is quite fast. So if I want to create a custom library for Kurdish, I want to know which algorithms Hunspell uses.

Here is a general idea of what I am trying to accomplish:

Consider this word: Bexshin (Forgiving), It can come in these forms:

Bimbexshe => [You] Forgive me
Bitbexshm => [I] Forgive you
Biyanbexshe => [You] Forgive them
And many more!

Instead of a list of words, we can have a list of patterns like so: Bi{pronoun}bexsh{pronoun}

More Examples:

Eat (Dexo{pronoun})

I eat => Dexom
We eat => Dexoyn
They eat => Dexon

Can be represented as:

Work (Kar{pronoun}dekrid)

I worked => Karmdekird
He Worked => Karîdekird
They worked => Karyandekird

So I need an algorithm to very quickly tell me what are the closest matching patterns, and then I can expand only those patterns and based on the Levenshtein distance to the input word give back a list of suggestions.

I know that I can read the source code, and I will. But it'd make my job much easier if you gave me a few leads on which algorithms can be useful based on your experience.

[Breaking] Reduce NuGet targets

Some of the build targets may be redundant. Will have to do some experimentation with different platforms to see what builds NuGet selects for them:

Unity
core 1.0, 1.1, 2.0
wp
net45, net451, net46, net 461, net 462
uwp
anything else I can think up

The 451 build should probably target 45, and the 461 build may not even need to exist. The netstandard1.1 build may be redundant with the PCL build. The PCL build may be harder to build in the future though... I'm not sure if people would even use the PCL or netstandard1.1 builds in the future.

Any suggestion on how to use this library for real-time word suggestions?

I'm trying to use WeCantSpell to create an autocorrect feature for a project I'm working on. I call WordsList.Suggest every time a new letter is added/removed from the word I'm writing but the results are generated very slowly, the more letters i feed into the method, the more time it takes to compute a result. Is there any way I can make it run faster? Is there a way to limit the number of words to retrieve as suggestions?

Support for UWP

I'm trying to use this package in a UWP app, but it works only when building the project in Debug mode.
When building in Release mode the build fails with the following error:

ILT0038: 'WeCantSpell.Hunspell.WordEntryDetail' is a value type with a default constructor. Value types with default constructors are not currently supported. Consider using an explicit initialization function instead.

I believe that to resolve this error the default empty CTOR in the WordEntryDetail class needs to be removed.

HunspellDictionary seems like it should not exist

As mentioned in #5 the HunspellDictionary class does not really do much. Is it helping or hurting? Maybe there is a better design that could be thought up here.

Get rid of my janky StringSlice

StringSlice is over. Span is what's hot.

Rename project

Because NetCore is not really a good library target anymore and because NetStandard may itself be a moving target (who knows?) I am going to rename this repository. The whole reason I started work on this port was to support another spelling related project I am working on. Because of this I think I will just turn that name (WeCantSpell) into an umbrella project containing both this Hunspell port as well as the tool that would consume it. Naming is hard, and ... whatever I'm tired of trying to pick a good one :) Nothing with respect to functionality will change but people referencing the pre-release packages will have to swap out some namespaces and a package.

[Q] Add custom words to loaded dictionary?

Hello,

I didn't find any ticket or test about adding custom words to a loaded dictionary.
Is it possible?

Or do it need to re-create a new dictionary by merging?

Thanks,
Hervé

Phonet performance (AU and ZA)

The method WordList.QuerySuggest.Phonet has two nested loops that query for entries from a PhoneTable that have rules starting with a given character. Indexing these entries in the table by the first rule character may have a positive impact on performance.

Try using a trie or sorted lookup datastructure to improve performance

During the initial developement of this port, a regular old dictionary of type Dictionary<string, T> was used to store various words, affixes, and their associated details. This really helped speed up the development of the code initially and even has some performance strengths in specific cases.

It is clear to me now that the choice to use Dictionary<,> was probably wrong for a lot of things as it is not a very good data structure when you want multiple related results for a query. When profiled this is where the code spends most of its time.

I would like to see how the use of a trie or a sorted collection impacts performance in the library. Looking back at the Hunspell source and some e-mails discussing design I think the original source uses something like a sorted linked list as the main storage for root words.

The following locations in the code may benefit from swapping out and utilizing a better datastructure:

GetMatchingAffixes

The methods SuffixCollection.GetMatchingAffixes and PrefixCollection.GetMatchingAffixes both look for affixes that begin or "end" with a certain string of text. This part of the code could probably benefit greatly from having a list of Affix<> that can be indexed by word instead of having all AffixEntries being confusingly nested into groups. It may also create a case for the internal Affix<> type to be converted to a reference type. The GetMatchingWithDotAffixes method is also realated but is often a cold path so optimiazation there should be focused on code size.

FindLargestMatchingConversion

The method MultiReplacementTable.FindLargestMatchingConversion while small may make many calls to a dictionary as it is a loop that is itself called from within a loop. It's responsibility is to find the longest matching entry for a substring.

PatternSet Check

The method PatternSet.Check searches for a pattern entry that has a text value that is ia subset of another.

WordList

While I'm not sure that a trie or even a sorted list would have an impact on the WordList.EntriesByRoot collection, the entries beind sorted may have a beneficial impact of keeping related roots near eachother in memory.

Get words that start with X

Is there a way to get all the words that start with a specific string?

Is this based on version 2?

If so, the license is the GNU LGPL, instead of the v1 tri-license.

Get a contributing file setup

Need to get a contributing file setup, maybe a nice .md file to document how things in the port map to origin would help too.

establish a style, document it, explain how strict/loose it is :deal-with-it:
need to start using PRs myself, get a quick blurb about that
some design docs would be good
get tabs and formatting and all that into editorconfig and link to it as a source of truth for style

Dictionaries are a bad choice for a dictionary

It seems that while Dictionary works it is not a very performant tool to use for this dictionary. Explore other options for both building a word list and later searching a word list.

[Breaking] Remove 3.5 target and add netstandard 2.0

I am looking at a new major version bump where I plan to remove .NET 3.5 support and add in net standard 2.0 support. Net45 and netstandard1.3 support will remain. I am curious if anybody would be impacted by these changes. I think I am going to do it anyway, but it helps me understand better if it is worth spending energy on fixes on a legacy v2 branch.

Project icon

Need to get an icon for the project so it can be easier to identify and stand out.

Areas for improvement: Infrastructure

Update the port to match origin for Hunspell v1

The code may have drifted away from the Hunspell origin... maybe not much due to the v2 work.

WordList.Query.CompoundCheckWordSearch cleanup

I really dislike the methods related to WordList.Query.CompoundCheckWordSearch and would like to see them cleaned up while preserving existing performance numbers. The method used to be much larger but warm methods where split up for performance reasons to reduce conditional checks and branching. The negative impact of this refactoring is that there are now multiple methods dangling our there. I would really like to see something that reduces the amount of code used to solve this problems, improving readability and maintaining the current performance. If no good solution can be found a refactoring of these methods into a new private type may be beneficial.

CompoundCheckWordSearch
CompoundCheckWordSearchMultiDetailWithWords
CompoundCheckWordSearchMultiDetailScpdFlags
CompoundCheckWordSearchMultiDetail
CompoundCheckWordSearchCompoundOnlyDetailScpd
CompoundCheckWordSearchCompoundOnlyDetail

Is this possible? Scan dic file and obtain all forms of all files

What i want is simple

I would like to obtain all words that can be composed from the given word

E.g.

make/UAGS

in us.dic file

So i want to obtain all words that can be obtained from this word/suffix combination

e.g. results are : made, making, makes etc

Fix up the perf tests

After updating NBench the tests are failing to load NHibernate to run the performance comparisons. Also, while I appreciate what NBench does I want to try out BenchmarkDotNet (sp?) and see if that is easier to run repeatedly and if it can also deliver consistent numbers.

can i use it as dotnet tool as part of msbuild in csproj?

Strong-Naming The Library

Hello,
I like this library, it is very useful. but a little thing to as if I may to sign the library.

Thanks.

Is it possible to use with Unity3D engine somehow?

Hi there, is it possible to use this port with unity3d game engine? I tried to implement it, but i failed. I guess problem is mono 2.0/3.5 and unity android platform.

Suggest algorithm optimization: Levenshtein distance

Because this is a port of the original library, I don't want to get too creative with the algorithms. For suggest, especially, these brute forcing nested loops can really hurt as the x64 compilers don't put the same amount of up-front effort into optimizations. Maybe something like Levenshtein distance, which I don't even know if I spelled correctly, could be of use without changing the results?

If something can be improved there, the benefits could impact #33, #40, and #43 .

Learning resources from https://github.com/Turnerj/Quickenshtein#learning-resources

Suggest() method result inconsitent

Hello

Here is my code (it's pretty basic) :

var dictionaryFr = WordList.CreateFromFiles(ressources + "\\fr-toutesvariantes.dic", ressources + "\\fr-toutesvariantes.aff");

for (int i = 0; i < 10; i++)
{
    List<string> suggestList = dictionaryFr.Suggest("Systemes").ToList();
    System.Diagnostics.Debug.WriteLine(suggestList.Count);
}

And this is what I'm getting :

So sometimes I get suggestions, sometimes I don't 😢 . Please help !

I'm using the 3.0.1 version of the nuget, and my code run on .Net Framework 4.5

Areas for improvement: Affix

Future target frameworks

Because the large gap between .NET 4.5 and .NET 6, I'm not sure yet if I want to continue supporting .NET versions that are out of support or that may soon be out of support. I'm doing some quick analysis of what versions should be supported by exploring github and nuget to see what is out there. Based on that, and my tolerance for adding in shims, I think I will be able to determine a new list of target frameworks for a future 4.x release containing possibly breaking changes. If anybody out there has any feedback, and is actually paying attention to this project, please let me know what framework versions you are targeting still.

Occasional System.IndexOutOfRangeException for Suggest

Hello,

I'm using library for spellcheck and suggestions. Nothing fancy in initialization:

hunspell = WordList.CreateFromStreams(dictionaryStream, affixStream);
hunspell.Suggest(word);

It's running on net6.0: <TargetFrameworks>netstandard2.0;net6.0</TargetFrameworks>
Here are my affix file and dictionary file en.zip (pretty sure I've got it from open sources, so it's safe to share)

Sometimes I get exception: System.IndexOutOfRangeException when trying to call .Suggest()
I can't reproduce locally, for given inputs and outputs that produce exception. This is happened in production/stage system over multiple days. For past 14 days and 400k calls, it happened 7 times so far.

Here is a stack trace that I received:

[14:53:56.828](154)  Exception: System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at WeCantSpell.Hunspell.WordList.QuerySuggest.LeftCommonSubstring(String s1, String s2)
   at WeCantSpell.Hunspell.WordList.QuerySuggest.NGramSuggest(List`1 wlst, String word, CapitalizationType capType)
   at WeCantSpell.Hunspell.WordList.QuerySuggest.Suggest(String word)

It happened for following suggestion requests: nth, gua, gua, colo, leav, Jira, o

I've just updated to version 3.1.1 to see if it will help, and I have high hopes about 4.0. I'll report later with my findings

NGram performance

The method WordList.QuerySuggest.NGram through the two methods it calls named NGramWeightedSearch and NGramNonWeightedSearch do a series of brute force substring checks that are pretty expensive. These two methods could likely benefit from using some kind of a better algorithm for these contains checks, even if extra allocations may be required, This will hopefully have a positive impact on the suggest performance which is pretty bad at the moment.

First algorithm fails on E5-26xx

We are using your tool in my enterprise, and automatically testing in a pool of computers.
We've found that the first search algorithm always fails in some computers, making our tests flaky. At first, we thought it may be related to work load on those machines, but tinkering with limit time parameters didn't change a thing. In the end, we realized it fails always in the same machines, and works fine in the rest.
The only point in common we've found on those machines is that them all have Intel processors from the E5-26xx family (2640, 2650, 2680).

Can you shed some light about this issue, or share a thought?
Thanks.

Update code to use C# 6 & 7

There are a few places that could really use some updated C# code.

How to ignore punctuation symbols

Hello. I have a text: "Hello, my name is Bob. How do you do?"
I use Split(' '); and do Check for all words. But how can i ignore: comma, dot and question in Check method?

Maybe this library have properties for this? Or i must use regex?

Areas for improvement: Word List

WordEntry & WordEntryDetails .GetHashCode: maybe this would be better off using the new HashCode?
WordList.NGramRestrictedFlags: what is this for?
- Should at least be useful to hold the data now: #56
WordList.QuerySuggest: this timer code needs to be removed and replaced with a different solution
WordList.Query.consts: There are a bunch of constants that maybe should be handed over to callers to control. For example, max suggestions could be lowered for some queries to improve performance for some cases. This will be closed out by #67

Libraries shouldn't know about filesystems (or web clients, or...)

(For all I know, this is from hunspell proper, so maybe this feedback is in the wrong place.)

tl;dr- Libraries should be as pure as possible, because purity maximizes flexibility, composability, testability, and greatly decreases the maintenance burden on the library author.

I would suggest an API like:

var checker = new HunspellDictionary(ISet<string> dictionaryWords, string affixContent);
bool notOk = hunspell.Check("teh");
var suggestions = hunspell.Suggest("teh");
bool ok = hunspell.Check("the");

As it is now, this library thinks in terms of filesystems for affix and dictionary files. But what if data source is a database? Or a web service endpoint? This is especially true for people that might consume this in an ASP.NET Core web application.

Throw useful parse exception if the application developer gives you bad input.
Leave the culture handling to the framework + application context
Reuse ISet.Comparer for GetHashCode and Equals implementations where it makes sense to do so

A bunch of stuff goes away afterwards:

CulturedStringComparer
Utf16StringLineReader
HunspellLineReaderExtensions
IHunspellLineReader
DynamicEncodingLineReader
StaticEncodingLineReader
Most of HunspellDictionary
All System.IO dependencies
AffixReader gets considerably simpler
Probably a bunch of other stuff

This sets you up to delete (or delegate to the framework) a bunch of other stuff:

CharacterSet -> HashSet<char>
ArrayWrapper, ArrayComparer -> use HashSet<T>
Deduper
EncodingEx

And avoids bugs around:

Byte-order markers (BOM)
Endian-ness

By maintaining purity, all of your operations are CPU-bound, so the need for async disappears. (Or rather, it shifts to the application developer who may want to run it on a threadpool thread, but that would be their choice to make.)

Provide simple construction API

If users want to construct a dictionary from another source or an affix from another source it should be much easier to do so. The API should be simplified to a point where one method call can be used to construct a dictionary from a set of words efficiently. There should also be some documentation to show this. From #5

Document how to use more codepages

Look to the tests for the implementation and throw something in the readme so others can use more than NetApp1.1 provides.

[Breaking] Clean up public surface

There are some public methods that are either terrible or not very useful that should be removed. While its for the best, this would be a breaking change and would need to be planned carefully.

aarondandy / wecantspell.hunspell Goto Github PK

wecantspell.hunspell's People

Contributors

Stargazers

Watchers

Forkers

wecantspell.hunspell's Issues

If Nuspell supports Infixes

If Nuspell doesn't support Infixes and there aren't any reasonable ways to work around that limitation

GetMatchingAffixes

FindLargestMatchingConversion

PatternSet Check

WordList

Recommend Projects

Recommend Topics

Recommend Org