Giter VIP home page Giter VIP logo

naive-bayes-classifier-implementation-from-scratch's Introduction

Naive-Bayes-Text-Classifier-Implementation-from-Scratch

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The data is organized into 20 different newsgroups, each corresponding to a different topic. Here is a list of the 20 newsgroups: alt.atheism comp.graphics comp.os.ms-windows.misc sci.med comp.sys.ibm.pc.hardware sci.space comp.sys.mac.hardware soc.religion.christian comp.windows.x misc.forsale talk.politics.guns rec.autos talk.politics.mideast rec.motorcycles talk.politics.misc rec.sport.baseball talk.religion.misc rec.sport.hockey sci.crypt sci.electronics

This processed version represents 18824 documents which have been divided to two subsets: training (11269 documents) and testing (7505 documents).

There are six files: map.csv, train label.csv, train data.csv, test label.csv, test data.csv, vocabulary.txt. The vocabulary.txt contains all distinct words and other tokens in the 18824 documents. The train data.csv and test data.csv are formatted "docIdx, wordIdx, count", where docIdx is the document id, wordIdx represents the word id (in correspondence to vocabulary.txt) and count is the frequency of the word in the document. The train label.csv and test label.csv are simply a list of label id’s indicating which newsgroup each document belongs to. The map.csv maps from label id’s to label names.

For each target value ωj (each newsgroup) • Calculate class prior P(ωj) • Calculate n: total number of words in all documents in class ωj (i.e., total length) • For each word wk in Vocabulary
-Calculate nk: number of times word wk occurs in all documents in class ωj. -Calculate Maximum Likelihood estimator PMLE(wk|ωj) = nk/n Bayesian estimator PBE(wk|ωj) = nk+1/n+|V ocabulary| (this is Laplace estimate).

NaiveBayes.py is the implementation code of Naive Bayes text classification on training dataset using Bayesian Estimation. testing.py is the implementation code of Naive Bayes text classification on testing dataset using Bayesian estimation and maximum likelihood estimation.

The accuracy obtained on training data is 94.72% whereas on testing data the accuracy is 85.4% using Bayesian estimation and on testing data using Maximum likelihood estimation, the accuracy is 23%.

naive-bayes-classifier-implementation-from-scratch's People

Contributors

venkateshmohan avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.