Giter VIP home page Giter VIP logo

id3-naivebayes's Introduction

Aiexercise2

This a 3rd year project for Athens University of Economics and Business(AUEB) Artificial Intelligence course. The task is to implement the ID3 and the NAIVE BAYES algorithm and use the train and test data from the following dataset:https://ai.stanford.edu/~amaas/data/sentiment/.

You can find info about the dataset here:https://keras.io/api/datasets/imdb/

The first step is to find the most commonly used words in the whole dataset of reviews and then filter them using ENTROPY. Then after we filter the words we must create a dictionary that will contain the most negatively or positively charged words and using that determine whether a review is negative or positive(we don't about the neutral reviews in the dataset).

The algorithms will be reading 0-1 vectors for each review: 0 meaning that the specific word in the dictionary was not present in the review and 1 meaning that the word was present in the review. In the Naive Bayes algorithm we use LA PLACE estimators always.

The algorithms run through the executables: execute.py and classify.py. In execute.py we build txt files that are binary vectors and there are 4 command line arguments to be given. The first x is the percentage of data you will use to train the algorithm, the second z we is the method we use to approximate logs(they require a lot of computational power) to optimizate the gathering of data and the last in the number of keywords that the dictionary will contain. In classify.py you use the three above except the last one where you can give a dummy parameter. In classify you can do 3 things: 1.) Read the a text file, meaning a review, and see if its positive or negative 2.) You can type the review yourself 3.) You can read an existing binary vector file

The if statement in executy inside the for loops is to determine what type of test we are going to do. We want to gather a lot of different metrics like testing the accuracy on training data while training the algorithms on training data etc. that's why it's done.

How we approximate the logs

1.) We have two categories p,q (negative and positive reviews). This is the entropy algorithm:

image

2.) Using change of base:

image

We end up with this:

image

Because we are finding entropies we values for the above equation are always between 0 and 1. We know:

image

For values between 0 and 1:

image

3.) We can conclude:

image

4.) The entropy becomes:

image

5.) We can change the fractions to:

image

Below are different metrics like precision,accuracy etc. for each algorithm

Also we are doing different types of tests like accuracy on training data, accuracy on testing data etc.

image

The below peaks are due to errors in the data: image

image

image

image

image

Below is a comparison of accuracy between different implementations from different teams

image

The commit history is a bit messy due the vast amount of files we needed to upload.

id3-naivebayes's People

Contributors

fotiosbistas avatar nikos-christodoulou avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.