winkjs / wink-eng-lite-model Goto Github PK

English lite language model for wink-nlp.

License: MIT License

nlp natural-language-processing model english winkjs winknlp tokenization sentence-boundary-detection negation-handling pos-tagging

wink-eng-lite-model's Introduction

wink-eng-lite-model

English lite language model for winkNLP

This is a pre-trained English language model for the winkjs NLP package — winkNLP. The lite model package has a size of ~890KB, which expands to about 2.4MB after installation. It is an open-source language model, released under the MIT license.

It contains models for the following NLP tasks:

Tokenization
Token's Feature Extraction
Sentence Boundary Detection
Negation Handling
POS tagging
Automatic mapping of British spellings to American
Named Entity Recognition
Sentiment Analysis
Custom Entities Definition
Stemming using Porter Stemmer Algorithm V2
Lemmatization
Readability statistics computation

Getting Started

Installation

The model must be installed along with the wink-nlp:

# Install wink-nlp
npm install wink-nlp --save
# Install wink-eng-lite-model
node -e "require( 'wink-nlp/models/install' )" wink-eng-lite-model

Example

We start by requiring the wink-nlp package and the wink-eng-lite-model. Then we instantiate wink-nlp using the language model:

// Load "wink-nlp" package.
const winkNLP = require('wink-nlp');
// Load english language model — light version.
const model = require('wink-eng-lite-model');
// Instantiate wink-nlp.
const nlp = winkNLP(model);

// Code for Hello World!
var text = 'Hello   World!';
var doc = nlp.readDoc(text);
console.log(doc.out());
// -> Hello   World!

Documentation

Learn how to use this model with winkNLP from the following resources:

Overview — introduction to winkNLP.
Concepts — everything you need to know to get started.
API Reference — explains usage of APIs with examples.
Release history — version history along with the details of breaking changes, if any.

About model

Performance

The winkNLP processes raw text at >525,000 tokens per second with this model, when benchmarked using "Ch 13 of Ulysses by James Joyce" on a 2.2 GHz Intel Core i7 machine with 16GB RAM. The benchmark covered the entire NLP pipeline — tokenization, sentence boundary detection, negation handling, sentiment analysis, part-of-speech tagging, and named entity extraction.

Tokenization

While it is trained to process English language text, it can tokenize text containing other languages such as Hindi, French and German. Such tokens are tagged as X (foreign word) during pos tagging.

POS Tagging

The model follows the Universal POS tags standards. It delivers an accuracy of ~94.7% on a subset of WSJ corpus — this includes tokenization of raw text prior to pos tagging.

Named Entity Recognition (NER)

The model is trained to detect CARDINAL, DATE, DURATION, EMAIL, EMOJI, EMOTICON, HASHTAG, MENTION, MONEY, ORDINAL, PERCENT, TIME, and URL.

Sentiment Analysis

It delivers a f-score of ~84.5%, when validated using Amazon Product Review Sentiment Labelled Sentences Data Set at UCI Machine Learning Repository.

Storage Structure

The model is contained in the standard NPM tarball format. You can find it under the latest release. The model is stored in form of trained data in JSON and binary formats. Apart from the data, there is a tiny fraction of JS glue code, which is primarily used during model loading.

Need Help?

If you spot a bug and the same has not yet been reported, raise a new issue.

About wink

Wink is a family of open source packages for Natural Language Processing, Machine Learning and Statistical Analysis in NodeJS. The code is thoroughly documented for easy human comprehension and has a test coverage of ~100% for reliability to build production grade solutions.

Copyright & License

It is licensed under the terms of the MIT License.

wink-eng-lite-model's People

Contributors

Stargazers

Watchers

Forkers

prtksxna

wink-eng-lite-model's Issues

NER for fractions

the entity name patterns used to recognize numeric entities appear to be lacking - i can't recognize fractions... and decimals also don't seem to pick up correctly. For example when used to extract the parts of a sentence indicating ingredients i'm using this custom pattern:

const patterns = [
  { name: 'number', patterns: [ '[|PERCENT] [|CARDINAL] [|ORDINAL]' ] },
  { name: 'measurement', patterns: [ 'cup' ] },
  { name: 'food', patterns: [ 'butter' ] },
  { name: 'adjective', patterns: [ 'ADJ' ] },
  { name: 'adverbPhrase', patterns: [ '[|ADV] [|VERB]' ] }
];

then i try this string

"1/2 cup yellow butter, softened." ---> 1/2 was not picked up:

  [Object: null prototype] { value: 'cup', type: 'measurement' },
  [Object: null prototype] { value: 'yellow', type: 'adjective' },
  [Object: null prototype] { value: 'butter', type: 'food' },
  [Object: null prototype] { value: 'softened', type: 'adverbPhrase' }

".5 cup yellow butter, softened." ---> .5 was changed to 5
[Object: null prototype] { value: '5', type: 'number' },

but 0.5 worked, and "one half" also worked.

I expected NUM to be defined, but apparently you are not using that one. https://universaldependencies.org/u/pos/all.html#al-u-pos/NUM

npm package wink-eng-lite-model

The wink-nlp install script does not work with pnpm monorepos...

❯ node -e "require( 'wink-nlp/models/install' )" wink-eng-lite-model

npm uninstall https://github.com/winkjs/wink-eng-lite-model/releases/download/1.3.1/wink-eng-lite-model-1.3.1.tgz
npm ERR! code EUNSUPPORTEDPROTOCOL
npm ERR! Unsupported URL Type "workspace:": workspace:*

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/brian/.npm/_logs/2022-08-11T14_46_21_368Z-debug-0.log
node:child_process:926
    throw err;
    ^

Error: Command failed: npm uninstall https://github.com/winkjs/wink-eng-lite-model/releases/download/1.3.1/wink-eng-lite-model-1.3.1.tgz
    at checkExecSyncError (node:child_process:851:11)
    at Object.execSync (node:child_process:923:15)
    at Object.<anonymous> (/project/node_modules/.pnpm/[email protected]/node_modules/wink-nlp/models/install.js:53:14)
    at Module._compile (node:internal/modules/cjs/loader:1120:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1174:10)
    at Module.load (node:internal/modules/cjs/loader:998:32)
    at Module._load (node:internal/modules/cjs/loader:839:12)
    at Module.require (node:internal/modules/cjs/loader:1022:19)
    at require (node:internal/modules/cjs/helpers:102:18) {
  status: 1,
  signal: null,
  output: [ null, null, null ],
  pid: 882155,
  stdout: null,
  stderr: null
}

Node.js v18.7.0

Distributing the wink-eng-lite-model as an npm package should fix this issue.

A certain non-natural language string makes readDoc freeze

Hi.
Thanks for the nice tool!

Please consider the code below. It freezes on my machine at the nlp.readDoc.

const winkNLP = require('wink-nlp')
const model = require('wink-eng-lite-model')
const nlp = winkNLP(model)
const content = '138375720109463900845220131105025504431resources094639008452'
nlp.readDoc(content)

I use NodeJS v 16 and

    "wink-eng-lite-model": "https://github.com/winkjs/wink-eng-lite-model/releases/download/1.3.0/wink-eng-lite-model-1.3.0.tgz",
    "wink-nlp": "^1.7.1"

Wikitext headings are considered part of the sentence

In Wikitext headings are marked up between multiple = characters and are separated from the text using new lines. When breaking the text into sentences, wink-nlp doesn't consider the heading to be a separate sentence.

const text = `He spoke of a five-year freeze in domestic spending, eliminating 
tax breaks for oil companies and reversing tax cuts for the wealthiest Americans, 
banning congressional earmarks, and reducing healthcare costs. He promised the 
United States would have one million electric vehicles on the road by 2015 and 
be 80% reliant on \"clean\" electricity.\n\n\n==== LGBT rights ====\nOn October 
8, 2009, Obama signed the Matthew Shepard and James Byrd Jr. Hate Crimes 
Prevention Act, a measure that expanded the 1969 United States federal hate-crime 
law to include crimes motivated by a victim's actual or perceived gender, sexual 
orientation, gender identity, or disability.On October 30, 2009, Obama lifted the 
ban on travel to the United States by those infected with HIV, which was celebrated 
by Immigration Equality.On December 22, 2010, Obama signed the Don't Ask, Don't 
Tell Repeal Act of 2010, which fulfilled a key promise made in the 2008 
presidential campaign to end the Don't ask, don't tell policy of 1993 that had 
prevented gay and lesbian people from serving openly in the United States Armed 
Forces. In 2016, the Pentagon also ended the policy that barred transgender 
people from serving openly in the military.`;
const doc = nlp.readDoc( text );

console.log( doc.sentences().itemAt(2).out() );

The output for this was:

==== LGBT rights ====
On October 8, 2009, Obama signed the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act, a measure that expanded the 1969 United States federal hate-crime law to include crimes motivated by a victim's actual or perceived gender, sexual orientation, gender identity, or disability.

The expected outcome would be that ==== LGBT rights ==== and the rest of the text are in two separate sentences. This might be too specific a use case to actually solve for.

Possible to use in browser?

I realize this package is meant for Node.js, but I was wondering how hard it would be to use in the browser. The first error that I get when trying to bundle it is related to this package:

error - ./node_modules/wink-eng-lite-model/dist/load-ner-model.js:1:0
Module not found: Can't resolve 'fs'

It seems like it would be possible to rewrite that file to directly require() the JSON file instead of lazy loading it, but I'm guessing the lazy loading was done on purpose?