Giter VIP home page Giter VIP logo

ngrammodel's Introduction

Ngram Language Model

Build status Coverage Status License

Ngram package provides basic n-gram functionality for Pharo. This includes Ngram class as well as String and SequenceableCollection extension that allow you to split text into unigrams, bigrams, trigrams, etc. Basically, this is just a simple utility for splitting texts into sequences of words. This project also provides

Installation

To install the packages of NgramModel, go to the Playground (Ctrl+OW) in your Pharo image and execute the following Metacello script (select it and press Do-it button or Ctrl+D):

Metacello new
  baseline: 'AINgramModel';
  repository: 'github://pharo-ai/NgramModel/src';
  load

How to depend on it?

If you want to add a dependency to this project to your own project, include the following lines into your baseline method:

spec
  baseline: 'NgramModel'
  with: [ spec repository: 'github://pharo-ai/NgramModel/src' ].

If you are new to baselines and Metacello, check out the Baselines tutorial on Pharo Wiki.

What are n-grams?

N-gram is a sequence of n elements, usually words. Number n is called the order of n-gram The concept of n-grams is widely used in natural language processing (NLP). A text can be split into n-grams - sequences of n words. Consider the following text:

I do not like green eggs and ham

We can split it into unigrams (n-grams with n=1):

(I), (do), (not), (like), (green), (eggs), (and), (ham)

Or bigrams (n-grams with n=2):

(I do), (do not), (not like), (like green), (green eggs), (eggs and), (and ham)

Or trigrams (n-grams with n=3):

(I do not), (do not like), (not like green), (like green eggs), (green eggs and), (eggs and ham)

And so on (tetragrams, pentagrams, etc.).

Applications

N-grams are widely applied in language modeling. For example, take a look at the implementation of n-gram language model in Pharo.

Structure of n-gram

Each n-gram can be separated into:

  • last word - the last element in a sequence;
  • history (context) - n-gram of order n-1 with all words except the last one.

Such separation is useful in probabilistic modeling when we want to estimate the probability of word given n-1 previous words (see n-gram language model).

Ngram class

This package provides only one class - Ngram. It models the n-gram.

Instance creation

You can create n-gram from any SequenceableCollection:

trigram := AINgram withElements: #(do not like).
tetragram := #(green eggs and ham) asNgram.

Or by explicitly providing the history (n-gram of lower order) and last element:

hist := #(green eggs and) asNgram.
w := 'ham'.

ngram := AINgram withHistory: hist last: w.

You can also create a zerogram - n-gram of order 0. It is an empty sequence with no history and no last word:

AINgram zerogram.

Accessing

You can access the order of n-gram, its history and last element:

tetragram. "n-gram(green eggs and ham)"
tetragram order. "4"
tetragram history. "n-gram(green eggs and)"
tetragram last. "ham"

String extensions

TODO

Example of text generation

1. Loading Brown corpus

file := 'pharo-local/iceberg/pharo-ai/NgramModel/Corpora/brown.txt' asFileReference.
brown := file contents.

2. Training a 2-gram language model on the corpus

model := AINgramModel order: 2.
model trainOn: brown.

3. Generating text of 100 words

At each step the model selects top 5 words that are most likely to follow the previous words and returns the random word from those five (this randomnes ensures that the generator does not get stuck in a cycle).

generator := AINgramTextGenerator new model: model.
generator generateTextOfSize: 100.

Result:

100 words generated by a 2-gram model trained on Brown corpus

 educator cannot describe and edited a highway at private time ``
 Fallen Figure Technique tells him life pattern more flesh tremble 
 with neither my God `` Hit ) landowners began this narrative and 
 planted , post-war years Josephus Daniels was Virginia years 
 Congress with confluent , jurisdiction involved some used which 
 he''s something the Lyle Elliott Carter officiated and edited and
 portents like Paradise Road in boatloads . Shipments of Student 
 Movement itself officially shifted religions of fluttering soutane .
 Coolest shade which reasonably . Coolest shade less shaky . Doubts 
 thus preventing them proper bevels easily take comfort was

100 words generated by a 3-gram model trained on Brown corpus

 The Fulton County purchasing departments do to escape Nicolas Manas .
 But plain old bean soup , broth , hash , and cultivated in himself , 
 back straight , black sheepskin hat from Texas A & I College and 
 operates the institution , the antipathy to outward ceremonies hailed 
 by modern plastic materials -- a judgment based on displacement of his 
 arrival spread through several stitches along edge to her paper for 
 further meditation . `` Hit the bum '' ! ! Fort up ! ! Fort up ! ! 
 Kizzie turned to similar approaches . When Mrs. Coolidge for

100 words generated by a 3-gram model trained on Pharo source code corpus

This model was trained on the corpus composed from the source code of 85,000 Pharo methods tokenized at the subtoken level (composite names like OrderedCollection were split into subtokens: ordered, collection)

 super initialize value holders . ( aggregated series := ( margins if nil
 if false ) text styler blue style table detect : [ uniform drop list input . 
 export csv label : suggested file name < a parametric function . | phase 
 <num> := bit thing basic size >= desired length ) ascii . space width + 
 bounds top - an event character : d bytes : stream if absent put : answers )
 | width of text . status value := dual value at last : category string := 
 value cos ) abs raised to n number of

Warning

Training the model on the entire Pharo corpus and generating 100 words can take over 10 minutes. So start with a smaller exercise: train a 2-gram model on a Brown corpus (it is the smallest one) and generate 10 words.

ngrammodel's People

Contributors

jecisc avatar juliendelplanque avatar myroslavarm avatar olekscode avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ngrammodel's Issues

Vocabulary needs to also be shortened in #removeNgramsWithCountsLessThan:

In the method we are deleting ngrams and reducing history counts, i think vocabulary needs to be cleaned up too (when word history becomes zero, for instance).

The main idea of this method is to get rid of tokens and their sequences that we find irrelevant, in order to speed up reading from file or lookup within the model. And in this case always keeping all the vocabulary entries defeats the purpose.

Create bigram from String

I tried to get a bi-gram of a String with letters as bigrams units, but I get an empty Collection as result :

#('Nelson') bigrams.
#Nelson bigrams.
'Nelson' bigrams.

I expected the output to be:

Ne el ls so on

Is this supported or I am missing something?

#ngramCounts method should be added to NgramModel

I think it's nice that we have everything wrapped inside a single ngram object but some missing functionality makes it difficult to get to some of the properties from the outside, such as accessing the ngram count info.

Update README

Remove information about text generation (maybe keep it as an application example).
Add instructions on how to use the library, how to configure smoothing, how to work with files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.