Giter VIP home page Giter VIP logo

redditcommenttextclassification's People

Contributors

hmarine avatar jawaialler avatar vaquierm avatar

Watchers

 avatar  avatar

redditcommenttextclassification's Issues

Define the functionality of the Dictionary class

Make a dictionary class.
Make it serializeable and deserializable. (Writable to a file path and instantiable from a file path)
It will contain all the possible words that we support as features in order.

Make it so that we can get the index of a certain word in in the feature vector.
We should also keep the word weights in here as well. This can be defaulted to all 1s.

Interface:
WordDictionary(fequencies: bool) # Frequencies or binary presence of word
fromFile(filepath): WordDictionary
toFile(filePath): void
toFeatureVector(comment): np.array

Obtain some stats about the data

Get some general statistics about the data that could be interesting to talk about for the report.
All of this should be in one jupyter notebook file
ex:

  • What are the most common words
  • What are the most common words for each subreddit that aren't really present for other subreddits (This is somewhat related to #4 since these are essentially the best words since they will probably be quite correlated with the output)
  • Average number of words for comments
  • Average number of words for each subreddit
  • Average number of youtube links per comment in each subreddit category (probably high for like overwatch/league of legends)
  • Check if there are any unknown charachters (There could be japanese charachters in comments in the anime subreddit)

Create a way to convert the whole dataset of Comments to np array

Create a way that given the path file of the raw dataset, we can convert it fully to a np array based on a specific dictionary.

The dictionary will basically say:

  • Ok i have 1000 words so my feature vector is 1000 words.
  • Do I want to do binary occurrence of a word? Or a frequency in a particular comment?
  • Lets convert the content of the comment into a feature vector of size 1000 based on the words it contains
  • Now do this for every single word

Also write a function so that this could be saved to a file. So we can directly get the csv data of the processed data for training rather than have to clean it completely every single time

Implement Bernoulli Naive Bayes model

You must implement a Bernoulli Naive Bayes model (i.e., the Naive Bayes model from Lecture 5) from scratch (i.e., without using any external libraries such as SciKit learn).

Hint : you many want to use Laplace smoothing with your Bernoulli Naive Bayes model.

Create a model validation pipeline

You must develop a model validation pipeline (e.g., using k-fold cross validation or a held-out validation set) and report on the performance of the above mentioned model variants.

You must run experiments using at least two different classifiers from the SciKit learn package (which are not Bernoulli Naive Bayes). Possible options are:

Also we will have to run this on the Naive Bayes model made from scratch as well.

Basically

We want a script that can take as input
models_to_run = ["NAIVE_BAYES", "SVM", "LR" ...]
dictionary_names = ["name", "of", dictionaries, "to", "run"]

Run all of the models against all the specified dictionaries. Generate some confusion matrices using sklearn for each combinations as well as
for each dictionary, generate a bar graph for the performance of each model and a text file containing the accuracies

Feature Selection from full dictionnary

  • Use the correlation criteria in order to decide which words might not be so useful from the full dictionary.
  • Use term weighing to determine if a word is useless if its weight is just too small.

If the words are useless, remove them

Create results from validation pipeline

@jawaialler In the validation pipeline, we want to create 3 types of files. For each run, we want to save a concussion matrix. Named “LEMMA_BINARY_MODEL_confusion.png

For each vocab and vectorizer we want to create a text file LEMMA_BINARY_accuracies.txt

Containing

Accuracies for vocabulary LEMMA and vectorizer BINARY

    LR: 52%
    NB: 55%
    SVM: 49%
    ...

Maybe a picture of a bargraph showing these accuracies as well for each model?

Create a vocabulary

  • Generate a full dictionary with N-Grams (Probably only up to two)
  • Remove all Stop words from the dictionary
  • Lemmatize the whole thing with NLTK
  • Create some custom Lemmatization such as youtube links. Make a custom word for youtube links
  • Create custom Lemmatization for weird smileys like (╹◡╹or ᕦ(ò_óˇ)ᕤ or ヽ(^ω^)ノ. I feel like this shit is gonna be all over the anime subreddit. We can map all these things to weird anime smiley or something

Scoring function

Find which words are the most related to a subreddit (todo in the data_analysis jupyter notebook)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.