curiosity-ai / catalyst Goto Github PK

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.

License: MIT License

C# 100.00%

nlp natural-language-processing artificial-intelligence embeddings csharp machine-learning natural-language-understanding ai

catalyst's People

Contributors

Stargazers

Watchers

catalyst's Issues

Corpus?

Can you point to the Universal Dependencies data you used? Or include it, guessing, in the Corpus project? Really excited to be able to try training.

Thanks
Dave G

Add a quick Dependency Parsing example to the readme.

Is your feature request related to a problem? Please describe.

I am having trouble figuring out how dependency parsing works. I found the AveragePerceptronDependencyParser and added it to the NLP pipeline after instantiating it with FromStoreAsync(Language.English, Version.Latest, "") but I don't know how to access its output. In particular, the DependencyType property on IToken looked promising, but always seemed to be the empty string.

Describe the solution you'd like

It would be nice to see a couple of quick examples for how to work with it such as extracting the root verb, subject, and object of a sentence.

Describe alternatives you've considered

Additional context

Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var nlp = await Pipeline.ForAsync(Language.English);
var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);
nlp.Add(await AveragePerceptronDependencyParser.FromStoreAsync(Language.English, Version.Latest, ""));
nlp.ProcessSingle(doc);

Thanks for your work on what looks like a very promising library!

Entity Recognition models: Option to replace previous entities

How to use pretrained model from disk

Lazy load auto-download is a nice feature but actually takes too long if I want to quickly execute a program with NLP Feature from catalyst over a small text. Is there any chance to download the pretrained model and load it upfront in the program (talking about maximum time one second?)

Exception for some language in EntityRecognition sample

Bug description
When I change language model in samples/EntityRecognition/Program.cs to:
Croatian, Danish, Serbian, Swedish, Arabic and Indonesian
I got exception:

"HResult=0x80131500
Message=Error occurred while reading from the stream.
...
...
at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
at Catalyst.Models.AveragePerceptronEntityRecognizer.<FromStoreAsync>d__24.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at Catalyst.Samples.EntityRecognition.Program.<AveragePerceptronEntityRecognizerAndPatternSpotterSample>d__1.MoveNext() in d:\dev\catalyst-master-2021-05-03\samples\EntityRecognition\Program.cs:line 49
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.GetResult()
at Catalyst.Samples.EntityRecognition.Program.<Main>d__0.MoveNext() in d:\dev\catalyst-master-2021-05-03\samples\EntityRecognition\Program.cs:line 35
Inner Exception 1:
NullReferenceException: Object reference not set to an instance of an object."

There is no such error for the following languages:
English, German, Spanish, Portuguese, Polish, Italian, French.

Steps to reproduce the behavior:

Go to this file:
https://github.com/curiosity-ai/catalyst/blob/master/samples/EntityRecognition/Program.cs
Change int this lines from "English" to "Croatian":

//Initialize the English built-in models
Catalyst.Models.Croatian.Register();
...
//Create a new pipeline for the english language, and add the WikiNER model to it
Console.WriteLine("Loading models... This might take a bit longer the first time you run this sample, as the models have to be downloaded from the online repository");
var nlp = await Pipeline.ForAsync(Language.Croatian);
nlp.Add(await AveragePerceptronEntityRecognizer.FromStoreAsync(language: Language.Croatian, version: Version.Latest, tag: "WikiNER"));
...

I used those PackageReference in sample project EntityRecognition:

<PackageReference Include="Catalyst" Version="1.0.16767" />  
<PackageReference Include="Catalyst.Models.English" Version="1.0.17127" />	<!--True-->	  	  
<PackageReference Include="Catalyst.Models.Croatian" Version="1.0.17127" />   <!--False-->

P.S.
Thank you for a very useful and wonderful C # NLP library (Catalyst).

Add Corpus project

Refactor tokenizer to support language-specific rules for infix / prefix

Cannot reproduce code execution of the LanguageDetection sample

Describe the bug
Referencing the latest NuGet package, Catalyst 1.0.16767, and essentially putting the code of the sample for Language Detection, throws an exception.
The exception: {"Unable to load one or more of the requested types.\r\nA ByRef-like type cannot be used as the type for an instance field in a non-ByRef-like type."}

my code is as follows:

Essentially the same as the sample, but I have tried the sample locally, updated the package to match, and it works...
I am also using .NET 5, the same as the sample.

The sensible part of the stacktrace is:
at System.Reflection.RuntimeModule.GetTypes(RuntimeModule module)
at System.Reflection.RuntimeModule.GetTypes()
at System.Reflection.Assembly.GetTypes()
at Mosaik.Core.ObjectStore.<>c.b__28_1(Assembly a)
at System.Linq.Enumerable.SelectManySingleSelectorIterator2.MoveNext() at System.Linq.Set1.UnionWith(IEnumerable1 other) at System.Linq.Enumerable.UnionIterator1.FillSet()
at System.Linq.Enumerable.UnionIterator1.ToArray() at System.Linq.Enumerable.ToArray[TSource](IEnumerable1 source)
at Mosaik.Core.ObjectStore.CreateDerivedFromModelDictionary()
at Mosaik.Core.ObjectStore.<>c.<.cctor>b__32_0()
at System.Lazy1.ViaFactory(LazyThreadSafetyMode mode) at System.Lazy1.ExecutionAndPublication(LazyHelper executionAndPublication, Boolean useDefaultConstructor)
at System.Lazy1.CreateValue() at System.Lazy1.get_Value()
at Mosaik.Core.ObjectStore.TryGetFormerNames(String name, String[]& formerNames)
at Catalyst.Models.FastText.d__42.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
at System.Runtime.CompilerServices.TaskAwaiter1.GetResult() at Catalyst.Models.FastText.<FromStoreAsync_Internal>d__41.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Catalyst.Models.FastTextLanguageDetector.<FromStoreAsync>d__5.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at System.Runtime.CompilerServices.TaskAwaiter1.GetResult()

Could it be an issue with getting the Language files from the store? if so, how is detailed how to do this?

To Reproduce
Create a new project, add reference to Catalyst, paste the code and run.

Expected behavior
runs flawlessly, language is returned.

Screenshots

Additional context
Add any other context about the problem here.

Add example projects

Bug / Broken Build - StarSpace and FastText.cs ref Argument Errors

Hi,
Excited to try this project but when I pull the solution and build I get build errors in StarSpace.cs and FastText.cs.

Tried to sort output below.

Dave G

1>src\Models\Embeddings\StarSpace\StarSpace.cs(603,38,603,48): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(603,54,603,67): error CS1615: Argument 2 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(611,35,611,45): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(611,47,611,63): error CS1503: Argument 2: cannot convert from 'float' to 'float[]'
1>src\Models\Embeddings\StarSpace\StarSpace.cs(615,41,615,51): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(615,57,615,70): error CS1615: Argument 2 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(849,55,849,67): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(849,73,849,86): error CS1615: Argument 2 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(856,59,856,71): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(856,77,856,90): error CS1615: Argument 2 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(901,35,901,49): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(901,51,901,71): error CS1503: Argument 2: cannot convert from 'float' to 'float[]'

1>src\Models\Embeddings\StarSpace\StarSpace.cs(906,45,906,59): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(906,65,906,80): error CS1615: Argument 2 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(928,80,928,83): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(928,89,928,92): error CS1615: Argument 2 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(928,122,928,125): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\StarSpace\StarSpace.cs(928,131,928,134): error CS1615: Argument 2 may not be passed with the 'ref' keyword

FAST TEXT
1>src\Models\Embeddings\FastText\FastText.cs(935,35,935,49): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\FastText\FastText.cs(935,51,935,70): error CS1503: Argument 2: cannot convert from 'float' to 'float[]'
1>src\Models\Embeddings\FastText\FastText.cs(955,31,955,45): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\FastText\FastText.cs(955,47,955,66): error CS1503: Argument 2: cannot convert from 'float' to 'float[]'
1>src\Models\Embeddings\FastText\FastText.cs(978,31,978,45): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\FastText\FastText.cs(978,47,978,72): error CS1503: Argument 2: cannot convert from 'float' to 'float[]'

1>src\Models\Embeddings\FastText\FastText.cs(1195,38,1195,44): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\FastText\FastText.cs(1203,31,1203,37): error CS1615: Argument 1 may not be passed with the 'ref' keyword
1>src\Models\Embeddings\FastText\FastText.cs(1203,39,1203,40): error CS1503: Argument 2: cannot convert from 'float' to 'float[]'

WARNINGS...
1>src\Models\Special\HyphenatedWordsCapturer.cs(19,40,19,51): warning CS1998: This async method lacks 'await' operators and will run synchronously. Consider using the 'await' operator to await non-blocking API calls, or 'await Task.Run(...)' to do CPU-bound work on a background thread.
1>src\Models\Normalizer\UpperCaseNormalizer.cs(19,40,19,51): warning CS1998: This async method lacks 'await' operators and will run synchronously. Consider using the 'await' operator to await non-blocking API calls, or 'await Task.Run(...)' to do CPU-bound work on a background thread.
1>src\Models\Normalizer\RemovePunctuationNormalizer.cs(20,40,20,51): warning CS1998: This async method lacks 'await' operators and will run synchronously. Consider using the 'await' operator to await non-blocking API calls, or 'await Task.Run(...)' to do CPU-bound work on a background thread.
1>src\Models\Normalizer\NumberToWordNormalizer.cs(20,40,20,51): warning CS1998: This async method lacks 'await' operators and will run synchronously. Consider using the 'await' operator to await non-blocking API calls, or 'await Task.Run(...)' to do CPU-bound work on a background thread.
1>src\Models\Normalizer\LowerCaseNormalizer.cs(19,40,19,51): warning CS1998: This async method lacks 'await' operators and will run synchronously. Consider using the 'await' operator to await non-blocking API calls, or 'await Task.Run(...)' to do CPU-bound work on a background thread.
1>src\Base\CharSpanExtensions.cs(188,47,188,49): warning CS0168: The variable 'ex' is declared but never used
1>src\Models\Normalizer\FoldToAsciiNormalizer.cs(23,40,23,51): warning CS1998: This async method lacks 'await' operators and will run synchronously. Consider using the 'await' operator to await non-blocking API calls, or 'await Task.Run(...)' to do CPU-bound work on a background thread.
1>src\Models\Embeddings\StarSpace\StarSpace.cs(223,17,223,27): warning CS0219: The variable 'impatience' is assigned but its value is never used
1>src\Models\Embeddings\StarSpace\StarSpace.cs(224,19,224,33): warning CS0219: The variable 'best_valid_err' is assigned but its value is never used

How to store models locally?

On the readme it says
"When using the new model packages, you can usually remove this line from your code: Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));, or replace it with Storage.Current = new DiskStorage("catalyst-models") if you are storing your own models locally."

But it is not really clear how to download and locate the models... should it be enough to add references to the NuGet Packages or something more needs to be done?

Describe the solution you'd like
A clear and concise description of how to do this is pretty much needed if we want to have a solution that does not need internet connectivity.

Describe alternatives you've considered
Could it be possible to have this clarified? If so, I volunteer to write a sample for the samples section ;) - you know, it's good to give back! ;)

Models and data not loading

Describe the bug
I have a WinForms .NET Core 5 application that uses Catalyst according to the documentation. However when trying to use the code to automatically loading the data, nothing happens. The download seems to start (as it creates some directories) but never downloads the data. I can wait for an hour in the debugger, the code doesn't return.

When running the samples of the repository (with the same code), the data is downloaded as expected.
I copied the "catalyst-models" folder to my solutions and have it copied to the debug output and then loading of the FastTextLanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");works.
However the pipeline = Pipeline.For(Language.English); never returns.

To Reproduce
This is the code that produces the problem

        public void Init()
        {
            Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
            var t = FastTextLanguageDetector.FromStoreAsync(Language.Any, Version.Latest, "");
            languageDetector = t.WaitResult();
            
            pipeline = Pipeline.For(Language.English);
            initCalled = true;
        }

As this code is practically the same as in the samples, I really don't see the problem.
What I have is

A WinForms NET 5.0 application
References a .NET 5.0 library project
The library project has the Catalyst nuget packages installed (1.0.16767)
The Init function above is called in the constructor of the "detector" class.

The output windows shows the following log information from Catalyst

[14:56:16 INF] [LOAD] [FastTextLanguageDetectorData-"Any"-v0] (1 B) from '..\\Models\--\FastTextLanguageDetectorData\v000000\model-FastTextLanguageDetector-v000000.bin'
[14:56:16 INF] [LOAD] [FastTextData-Version-"Any"-v-1] (1 B) from '..\\Models\--\FastTextData-Version\v-000001\model-language-detector-v-000001.bin'
[14:56:16 INF] [LOAD] [FastTextData-Version-"Any"-v-1] (1 B) from '..\\Models\--\FastTextData-Version\v-000001\model-language-detector-v-000001.bin'
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.6\System.Runtime.CompilerServices.Unsafe.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
[14:56:17 INF] [LOAD] [FastTextData-"Any"-v0] (15.4 MB) from '..\\Models\--\FastTextData\v000000\model-language-detector-v000000.bin'
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.6\System.Resources.Writer.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.6\System.Collections.NonGeneric.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App\5.0.6\System.Configuration.ConfigurationManager.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.
[14:56:19 INF] [B] Initializing Entries
[14:56:22 INF] [E] Initializing Entries in 2.8300 seconds at 413,653 oper/s, total of 1,170,682 operations
[14:56:22 INF] [LOAD] [SentenceDetectorModel-Version-"English"-v-1] (1 B) from '..\\Models\en\SentenceDetectorModel-Version\v-000001\model-v-000001.bin'
[14:56:22 INF] [LOAD] [SentenceDetectorModel-Version-"English"-v-1] (1 B) from '..\\Models\en\SentenceDetectorModel-Version\v-000001\model-v-000001.bin'
"GarbageDetection.exe" (CoreCLR: clrhost): "C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.6\System.Security.Cryptography.Csp.dll" geladen. Das Laden von Symbolen wurde übersprungen. Das Modul ist optimiert, und die Debugoption "Nur eigenen Code" ist aktiviert.

Adding Catalyst to a UWP app causes OutOfMemoryException compiling.

Just adding Catalyst to a UWP app (tried versions 17763, 18362) causes the error "Framework resource extraction failed. Exception of type 'System.OutOfMemoryException' was thrown." when compiling.

To Reproduce
Create a UWP app in Visual Studio 2019, add Catalyst through NuGet. Add the using, and possibly a use of it. Build for either x64 or x86.

Expected behavior
Compilation works.

chinese support?

can support chinese?

Transfer training code

In Language enum Luxembourgish is both 83 and 86

Language
Luxembourgish

Describe the bug
In Language enum Luxembourgish is both 83 and 86

83: Limburgan_Limburger_Limburgish,
86: Luxembourgish_Letzeburgesch,

Investigate porting newer CLD3 model

Google published a new language detection model that superseeds their former CLD2 model: https://github.com/google/cld3

Might be interesting as a lighter alternative to our current FastText model.

Pattern Spotter: Add min/max number of tokens for Multiple to match

Create examples using dotnetfiddle.net

Example: https://dotnetfiddle.net/qxW9eh

Unified Verb Index Support? (and Roadmap?)

Hi,
Is there any chance there is already "Unified Verb Index" support lurking or in the roadmap? If not, any chance it might make it into a roadmap or future release?

Thanks,
Dave G

Write documentation

Tutorials

Is Sentiment Analysis supported?

Is Sentiment Analysis supported?
Maybe there are some examples?

Add support for lemmatization

Non-capture patterns for Pattern Spotter

non-capture patterns for Pattern Spotter (i.e. must match, but don't capture as entity)

Sentence detection broken ?

Describe the bug
Sentence detection only reacts to punctuation

To Reproduce
Use sentence detector on non punctuated sentences. ( not seperated )

Expected behavior
Sentence detector should detect when a sentence ends in the absence of punctuation

Also note when training Sentence detector models from scratch using UD ... the resulting models are extremely small.. 2kb for english or german ...

Mosaik.Core source missing?

When I click the source repository link for the NuGet package www.nuget.org/packages/Mosaik.Core/, which Catalyst depends on, the repository is missing. Did it move? I'm not sure what Mosaik.Core even does, but it's kind of a big problem if one of the dependencies for this project just dropped off the face of the earth. I tried using NuGet's feature to contact the package owner, but I got no response.

How to get embedding matrix from StarSpace

I would appreciate if you could give an example of the code required to use StarSpace, particularly when mapping a bag of words to a bag of tags, as originally described in:

https://github.com/facebookresearch/StarSpace#tagspace-word--tag-embeddings

Indeed, I'm having trouble when using StarSpace...

Suppose that I have a TXT file where each line contains a set of words that are semantically related to a tag (with the prefix "label"), as in:

decorate dress garnish adorn beautify embellish __label__decorate
knife cutlery cutter eat silverware butcher carve __label__knife
etc...

Here, one of the first questions is:

What should the input file format be? The default one in the original StarSpace is:

word1 word2 word3... [tab] __label__label1

Is this right? In my case, each line contains from 2 to 300 words and only one label.

With respect to the code, the goal is to get the label-embedding matrix generated from the input file, i.e. we should be able to get the vector corresponding to each label. As we work with unigrams and we expect to have vectors of 100 dimensions, the initial code could be as follows:

        languages.registerLanguage("English");
        Pipeline nlp = await Pipeline.ForAsync(languages.English);

        IEnumerable<IDocument> docs = GetDocsFromSingleFile(file); //this method converts each line of the file into an IDocument object
        IEnumerable<IDocument> parsed = nlp.Process(docs);

        StarSpace ss = new StarSpace(languages.lang, 0, "starspace-model", StarSpace.ModelType.TagSpace);			
	ss.Data.TrainWordEmbeddings = true;
	ss.Data.Dimensions = 100;
	ss.Data.WordNGrams = 1;
	ss.Data.InputType = "LabeledDocuments";
        ss.Train(parsed);
		...

Is this code right? I have just tried this code, and an error raises while training the model:

Exception thrown:
System.ArgumentOutOfRangeException: 'Specified argument was out of the range of valid values.'
Call Stack:
Mosaik.Core.dll!Mosaik.Core.ThreadSafeFastRandom.ThrowMaxValueOutOfRange()

By the way, how can I get the label-embedding matrix after training the model?

Thank you, and congratulations for your work.

Training a pre-trained NER model with new data

Hi,

I was trying to re-train a custom trained NER model. But while running the function to train the loaded model, I get "null reference exception" as the dictionary, MapEntityTypeToTag (in AveragePerceptronEntityRecognizer class) is null. This dictionary seems to be only initialised when you train a fresh model. Is there any other way to load a trained model and then retrain it using new samples?

To Reproduce

Storage.Current = new DiskStorage(<Path to directory where NER model was saved during training>);
var documents = new List<IDocument>();
documents.AddRange(ReadFile(<path to file containing new data>));

var model = await AveragePerceptronEntityRecognizer.FromStoreAsync(language: language, version: version, tag: <ModelTag>);
model.Train(documents);  //getting null reference exception here
await model.StoreAsync();

Additional context
I was successfully able to train a fresh model and then test it on some sample data. But while loading the trained model and re-training it with new dataset (even with the same dataset as old training dataset) causes error.

Trained data model as the Stanford NLP .Net

Hi,

Currently, trained data do not recognise organisation properly, for .e.g following text:
"Centre for Dermatology Research, Manchester Academic Health Science Centre and NIHR Manchester Biomedical Research Centre, University of Manchester, Manchester, UK"

While running Entity recognition model using catalyst, it does not correctly recognise the "Centre for Dermatology Research" as organisation and for many other.

Same full text if we pass in the stanford nlp .net demo site:
http://corenlp.run/
we get the correct recognised organisation along with city, state and country.

If we can get the same trained default data model as in Stanford nlp .net in the catalyst train data it will be a great feature to the catalyst NER project.

Currently Stanford NLP .Net do not support .Net Core and there is no plan for it at-least not as of now.

I would definitely add that am literally in love with the catalyst project, its so simple and easy to run.
Wish i can have some way to train data just like the stanford nlp .net.

Keep up the good work, it really helps the people like me who is noob in machine learning programming.

Cheers,
Syd

Help with AveragePerceptronEntityRecognizer for Danish

Hi,

Maybe it's i'm just a noob, but i can't figure out to get/create a model-WikiNER-v000000.binz for Danish AveragePerceptronEntityRecognizer?

Can you be helpful with this? Thank you! :-)

Reorganize language-specific code into partial classes

Regex timeout exception on processing document text

Describe the bug
Processing a slightly larger document text throws a RegexTimeoutException.

To Reproduce
Extract the text from the attached document. Run a process pipeline against the extracted text.

Expected behavior
The text should be processed correctly and no exception should be thrown.

Screenshots

Additional context
Document used as a source (extracted with Apache Tika) is attached.
Also linked is the extracted text.
Cillian_Murphy.pdf
https://paste.ee/p/trAmx

Edit: the bug is not a showstopper. It seems that it's internally handled somewhere, but I still wonder if it's normal behavior.

.NET standard 2.0 actually not supported

Readme claims supporting .NET standard 2.0.

Package Catalyst 1.0.16767 is not compatible with netstandard2.0 (.NETStandard,Version=v2.0). Package Catalyst 1.0.16767 supports:

net5.0 (.NETCoreApp,Version=v5.0)
netcoreapp3.1 (.NETCoreApp,Version=v3.1)
netstandard2.1 (.NETStandard,Version=v2.1)

Setup models repository & auto-download

Transfer test code

Setup tests on Azure DevOps

PartOfSpeech long names ?

Hi, congratulations for the incredible work.
I have a very dumb but blocking question, where i can find the long name of PartOfSpeech values ?
I don't understand what values like X, ADP, SCONJ, INTJ and so should mean

Thanks

Repo/Prject for NER Models?

Can you please post repo that shows how you are training NER? It seems like you are using WikiNER data... but are you using anything else? Will you please share a repo/add project that duplicates how the "included" models are built?

The single biggest holdup for me for committing to us Catalyst moving ahead is the ability to see how you are doing NER training for the models that are included.

Code support for Pattern Matching matches post-processing

"Collection was modified; enumeration operation may not execute" thrown by await FastTextLanguageDetector.FromStoreAsync in .NET Core 3.1

The following code throws an InvalidOperationException with the message "Collection was modified; enumeration operation may not execute." on the line that calls await FastTextLanguageDetector.FromStoreAsync when the application targets .NET Core 3.1.

However, it works fine when targeting .NET 5!

using System;
using System.IO;
using System.Threading.Tasks;
using Catalyst;
using Catalyst.Models;
using Mosaik.Core;
using Version = Mosaik.Core.Version;

namespace CatalystSimilarityExample
{
    class Program
    {
        static async Task Main()
        {
            const string modelFolderName = "catalyst-models";
            Storage.Current = new OnlineRepositoryStorage(new DiskStorage(modelFolderName));
            var languageDetector = await FastTextLanguageDetector.FromStoreAsync(
                Language.Any,
                Version.Latest,
                ""
            );
        }
    }
}

The stack trace shows this:

at System.ThrowHelper.ThrowInvalidOperationException_InvalidOperation_EnumFailedVersion()
at System.Collections.Generic.Dictionary`2.KeyCollection.Enumerator.MoveNext()
at Catalyst.Models.FastText.CompactSupervisedModel()
at Catalyst.Models.FastTextLanguageDetector.<FromStoreAsync>d__5.MoveNext()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at CatalystSimilarityExample.Program.<Main>d__0.MoveNext() in C:\\Users\\Dan\\source\\repos\\ParallelLinqExample\\CatalystSimilarityExample\\Program.cs:line 17

Catalyst.Training Details Request: OntoText & UD Version

Hi,
I'm trying to add the closest match UD resources and Ontonotes resources to run WikiNERTraining.

Can you point me to which US English UD files your are using? Is it UD_English-EWT?

And which Ontonotes data? Is connll formatted and /or 5.0? ( like https://github.com/ontonotes/conll-formatted-ontonotes-5.0/tree/master/conll-formatted-ontonotes-5.0/data )

Thanks!

Help needed in the pattern composition

mp => mp.Add(
                    new PatternUnit(P.Single().WithTokens(quadriTokens)),
                    new PatternUnit(P.Single().WithLength(1, 2).HasNumeric()),
                    new PatternUnit(P.SingleOptional().WithTokens(colTokens)),
                    new PatternUnit(P.SingleOptional().WithLength(1, 2).HasNumeric())
                    )

i would able to detect this kinda pattern:
a letter (A,B,C,D,E,F,G,H)
1, 2 numeric digits
optionally "col" or "col."
optionally 1, 2 numeric digits.
example
"A01", or "A01 col. 1", or "A01 col.1" or "A01 col 1", "A01 col. 1"
i'm able to make it work, when is only the first patternunit so "A"
but i'm not able to make it recognize "A01" even removing the optionals.
where i mistake?

Regex Support in Pattern.

Hi,

Is there any way I can add regex to get entites (in PatternUnit or Spotter)? I've tested it with WithTokens, but apparently it only works with strings and I haven't found another way that can enable me to work with regex.

Maybe something like:

new PatternUnit(Catalyst.PatternUnitPrototype.Single().WithRegex(regex)

Or is there some way of abstraction that I can work with its own function that returns in bool?

Thanks,
Cheers!

Is it possible to implement a skills extractor with catalyst using Named Entity Recognition?

Is your feature request related to a problem? Please describe.
Basically I am looking to build a service that can extract skills from a job ad.

Describe the solution you'd like
A sample text would be: "We are looking for backend developers who are proficient in C#"
and I'd like to extract "backend" and "C#" from it.

Describe alternatives you've considered
I got parts of it working with spaCy in python, but I'd like a .NET implementation.

German POS Tagging marks the alphabet letter as NOUNfor German

in more than 20%, the POS Tagger for German marks the alphabet letters as NOUN in Twitter text.
In comparison, the Corenlp does not make this mistake.
I am using the online trained models:
` public CatalystAnalyzer()
{

        Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
    }
    public List<string> GetNouns(string text, string language)
    {
        Language lang = new Language();

        switch (language.ToLower())
        {
            case "german":
                lang = Language.German;
                break;
            case "english":
                lang = Language.English;
                break;
            case "french":
                lang = Language.French;
                break;
            case "spanish":
                lang = Language.Spanish;
                break;
            default:
                lang = Language.Any;
                break;
        }

        Pipeline nlp;
        try
        {
            nlp = nlpSet[lang];
        }
        catch
        {
            nlp = Pipeline.For(lang);
            nlpSet.Add(lang, nlp);
        }


        var doc = new Catalyst.Document(text, lang);
        nlp.ProcessSingle(doc);

        var tokens = doc.Spans.SelectMany(s => s.Tokens);
        var stopwords = new NLPToolsLib.StopWords();
        var aspects = tokens.Where(s => s.POS == PartOfSpeech.NOUN).Select(s => s.Value).ToList();
        List<string> result = new List<string>();
        
        foreach (var aspect in aspects)
        {
            if (!stopwords.isStopWord(aspect, LanguageDetection.GetLangIsoCode(language)))
                result.Add(aspect);
        }
        return result;
    }`

Allow loading models from a given stream

Hi !

Right now loading models for the various algorithms is handled via the IStorage interface, which handles versioning etc..
However a common case at least for us is that we already have a storage abstraction in place, which cannot be used to implement the full IStorage interface ( so bridging is a non option )

For reading only would it be possible to add the option to read models from a stream directly ? Versioning etc would be handled on our side then..

How to create our own model?

Is your feature request related to a problem? Please describe.
My enterprise is considering using your great library to analyze texts. We're talking care home environment, just to clarify. So I was wondering -and can't see anywhere- how could we create new types of tags, like for instance "meds", or modify/increase others, like adding to locations "room", "toilet", and so on.

Describe the solution you'd like
An explanation on how to create and expand the tagging dicts.

After carefully reading issue 45, closely related, I get a few points:

No model retraining, it's better to increase the dataset and retrain from zero.
How to train a a model ( code here: https://github.com/curiosity-ai/catalyst/blob/master/Catalyst.Training/src/TrainWikiNER.cs )
Models are trained in different ways.

So what I am asking for, actually, is a general guide to train a model: what method to use, where to get datasets, how to store them locally or create NuGet package (ok, that last thing is probably out of scope of Catalyst).

Add support for TensorFlow models using TensorFlow.Net

add similar support as pytorch-transformers, i.e.:

BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
GPT (from OpenAI) released with the paper Improving Language Understanding by Generative Pre-Training by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.
GPT-2 (from OpenAI) released with the paper Language Models are Unsupervised Multitask Learners by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
Transformer-XL (from Google/CMU) released with the paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context by Zihang Dai*, Zhilin Yang*, Yiming Yang, Jaime Carbonell, Quoc V. Le, Ruslan Salakhutdinov.
XLNet (from Google/CMU) released with the paper XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
XLM (from Facebook) released together with the paper Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau.

Benchmarking Information

Can you please add some more Info's about Comparing Catalyst with sapCy?
f.e. Is the Accuracy the same as the SpaCy V2?

curiosity-ai / catalyst Goto Github PK

catalyst's People

Contributors

Stargazers

Watchers

Forkers

catalyst's Issues

Recommend Projects

Recommend Topics

Recommend Org