ttezel / bayes Goto Github PK

View Code? Open in Web Editor NEW

561.0 561.0 106.0 38 KB

Naive-Bayes Classifier for node.js

JavaScript 100.00%

bayes's People

Contributors

Stargazers

Watchers

Forkers

serendipious imclab madjake wingmanzz crubier cecchi andrepcg austinbirch kongxingagit thomasmodeneis daithi-coombes duyetdev-collections ngoclt-28 putraxor iaunn mguida22 jseagull onzu qq99 bolak cranesandcaff nchereva pwlmaciejewski chrift phpmind kei-sato jzf2101 7linternational mbijon lemnik hbcodexci sapila adrianmcli formigone jgabriellima yelabbassi tiecoo wozacosta theo- harendranathvegi9 royalsix rezurrector pulipulichen schue alexxnica kryndex ruizhang2016 socialskyinc solertis sparky2708 harwoodleon bithaolee jdrew1303 linebreaker profnandaa aliwalker arcanabatch nurulc ghaidamk unsecureio ellerbrock santosh898 ikechukwuakalu rahulrana95 emersonsm martineboh carloslema velkitor afcarl raghavadss max-degterev grasskin ngorovitch kartikwatwani satyamallya omerfrq bin-huang matthieunelmes ostheperson gostartlab wianoski ferrriii gesuvs indriafranda haidarrifki yolavegita nabilla2 wahyukurniasari thomaschampagne banerjee-tuhina adi-darachi mill6-plat6aux lukaspawlik kiku-jw rhv044 thanveerahamed tungvn alro-cu tchernobyl orxtime

bayes's Issues

Not taking custom tokenizer?

I'm rusty on my JS so I'm probably doing something dumb here, but I can't get your classifier to take a custom tokenizer.

const classifier = bayes({'tokenizer': tokenizer});

var tokenizer = function (text) {
  var rgxPunctuation = /[^(a-zA-Z)+\s]/g

  var sanitized = text.replace(rgxPunctuation, ' ').toLowerCase();

  return sanitized.split(/\s+/)
}

If I put a console.log in there, it's clear it's not getting executed.

Use of plain objects prevents tokens or categories named "constructor"

The "vocabulary", "docCount", "wordCount", "wordFrequencyCount" and "categories" data structures in the classifier are defined as {} which means that "constructor" is a field. This causes problems for documents containing the word "constructor" as well as categories with that name. The solution is to use Object.create(null) as is already used elsewhere in the existing code.

Classifier does not work, when text contains "contructor" as token.

The problem is this line: https://github.com/ttezel/bayes/blob/master/lib/naive_bayes.js#L248

Naivebayes.prototype.frequencyTable = function (tokens) {
  var frequencyTable = {}

  tokens.forEach(function (token) {
    if (!frequencyTable[token])
      frequencyTable[token] = 1
    else
      frequencyTable[token]++
  })

  return frequencyTable
}

When token is "constructor", frequencyTable[token] is always true, because every object in Javascript natively has the constructor property. Therefore frequencyTable[token]++ runs and this results in NaN.

To fix this, we need to check for if (!frequencyTable.hasOwnProperty(token)). We will overwrite the constructor property, but we do not need it for the object anyway.

Method to clear classifier

Would it be too much trouble to make a public method that clears all the learned phrases?

Possible to return multiple categories?

In one of the examples you say is a news article about technology, politics, or sports ?

What if it's an article about robots playing football?

In this case I would think the categories should be technology & sports.

Can the current code return multiple categories?

Thank you.

UTF-8 support

Sadly does not support UTF-8. The problem lies here:

getWords : function(doc) {
    if (_(doc).isArray()) {
      return doc;
    }
    var words = doc.split(/\W+/);
    return _(words).uniq();
  }

doc.split(/\W+/);

does not seem to work for UTF-8

Here is an example with Cyrilic language (like Russian):

"Надежда за обич еп.36 Тест".split(/\W+/);

This returns:

[ "", "36", "" ]

Instead should return something like this:

[ "Надежда", "за", "обич", "еп", "36", "Тест"]

I was looking for fix, but ended up here:
http://stackoverflow.com/questions/280712/javascript-unicode-regexes

Support async tokenizer?

It is difficult to segment a text into tokens in some languages(such as Chinese), there are many hard works need to do to implement better tokenizer. For this reason, sometimes the tokenizer is implemented in other programming language, even in other services( in microservices architecture). In this case, support async version of tokenizer to request tokens between services is required.

PR: #21

ttezel / bayes Goto Github PK

bayes's People

Contributors

Stargazers

Watchers

Forkers

bayes's Issues

Not taking custom tokenizer?

Use of plain objects prevents tokens or categories named "constructor"

Classifier does not work, when text contains "contructor" as token.

Method to clear classifier

Possible to return multiple categories?

UTF-8 support

Support async tokenizer?

How well will this handle Chinese?

how to use word vectors?

fromJson and custom tokenizer

Allow for empty classification JSONs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent