vaquierm / redditcommenttextclassification Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 16.59 MB

💬 Classification of Reddit comments to the subreddit they were posted in

Python 6.15% Jupyter Notebook 93.85%

natural-language-processing machine-learning naive-bayes-implementation classification

redditcommenttextclassification's People

Contributors

Watchers

redditcommenttextclassification's Issues

Define the functionality of the Dictionary class

Make a dictionary class.
Make it serializeable and deserializable. (Writable to a file path and instantiable from a file path)
It will contain all the possible words that we support as features in order.

Make it so that we can get the index of a certain word in in the feature vector.
We should also keep the word weights in here as well. This can be defaulted to all 1s.

Interface:
WordDictionary(fequencies: bool) # Frequencies or binary presence of word
fromFile(filepath): WordDictionary
toFile(filePath): void
toFeatureVector(comment): np.array

Create our Supermodel for Kaggle

Create some crazy magic ensemble method that's going to do so so good hopefully

Add new features based on sentiment

This

Has some pre implemented features to do so

Write script to output kaggle submission

Obtain some stats about the data

Get some general statistics about the data that could be interesting to talk about for the report.
All of this should be in one jupyter notebook file
ex:

What are the most common words
What are the most common words for each subreddit that aren't really present for other subreddits (This is somewhat related to #4 since these are essentially the best words since they will probably be quite correlated with the output)
Average number of words for comments
Average number of words for each subreddit
Average number of youtube links per comment in each subreddit category (probably high for like overwatch/league of legends)
Check if there are any unknown charachters (There could be japanese charachters in comments in the anime subreddit)

Make a nice readme

Make a pretty readme with pictures and all that jazz

Create a normalization for typos

For example:

Map all words
no, no, noo, nooo, etc... -> no

There might be some API that can do this for us

Fix path issues with config

Implement extra features based on average sentiment of comment

Use some API that can get the average sentiment of a word, we want to maybe get the average sentiment of the thread to add a feature.

Create a way to convert the whole dataset of Comments to np array

Create a way that given the path file of the raw dataset, we can convert it fully to a np array based on a specific dictionary.

The dictionary will basically say:

Ok i have 1000 words so my feature vector is 1000 words.
Do I want to do binary occurrence of a word? Or a frequency in a particular comment?
Lets convert the content of the comment into a feature vector of size 1000 based on the words it contains
Now do this for every single word

Also write a function so that this could be saved to a file. So we can directly get the csv data of the processed data for training rather than have to clean it completely every single time

Create the Initial Directory Structure

Create the Initial Directory Structure for the project.

Create all the template files for urgent tasks

Implement Bernoulli Naive Bayes model

You must implement a Bernoulli Naive Bayes model (i.e., the Naive Bayes model from Lecture 5) from scratch (i.e., without using any external libraries such as SciKit learn).

Hint : you many want to use Laplace smoothing with your Bernoulli Naive Bayes model.

Create a model validation pipeline

You must develop a model validation pipeline (e.g., using k-fold cross validation or a held-out validation set) and report on the performance of the above mentioned model variants.

You must run experiments using at least two different classifiers from the SciKit learn package (which are not Bernoulli Naive Bayes). Possible options are:

Logistic regression
(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
Decision trees
(https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
Support vector machines [to be introduced in Lecture 10 on Oct. 7th]
(https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

Also we will have to run this on the Naive Bayes model made from scratch as well.

Basically

We want a script that can take as input
models_to_run = ["NAIVE_BAYES", "SVM", "LR" ...]
dictionary_names = ["name", "of", dictionaries, "to", "run"]

Run all of the models against all the specified dictionaries. Generate some confusion matrices using sklearn for each combinations as well as
for each dictionary, generate a bar graph for the performance of each model and a text file containing the accuracies

Feature Selection from full dictionnary

Use the correlation criteria in order to decide which words might not be so useful from the full dictionary.
Use term weighing to determine if a word is useless if its weight is just too small.

If the words are useless, remove them

In the Supermodel implement a PCA module to reduce dimentionality of the data

Use the SKlearn PCA to get the k most dominant eigen vectors to considerably reduce the dimentionality of the data to make training/prediction faster.
This is quite important since out model will probably be some crazy ensemble method.

Create results from validation pipeline

@jawaialler In the validation pipeline, we want to create 3 types of files. For each run, we want to save a concussion matrix. Named “LEMMA_BINARY_MODEL_confusion.png

For each vocab and vectorizer we want to create a text file LEMMA_BINARY_accuracies.txt

Containing

Accuracies for vocabulary LEMMA and vectorizer BINARY

    LR: 52%
    NB: 55%
    SVM: 49%
    ...

Maybe a picture of a bargraph showing these accuracies as well for each model?

Create a vocabulary

Generate a full dictionary with N-Grams (Probably only up to two)
Remove all Stop words from the dictionary
Lemmatize the whole thing with NLTK
Create some custom Lemmatization such as youtube links. Make a custom word for youtube links
Create custom Lemmatization for weird smileys like （╹◡╹or ᕦ(ò_óˇ)ᕤ or ヽ(＾ω＾)ﾉ. I feel like this shit is gonna be all over the anime subreddit. We can map all these things to weird anime smiley or something

Scoring function

Find which words are the most related to a subreddit (todo in the data_analysis jupyter notebook)

vaquierm / redditcommenttextclassification Goto Github PK

redditcommenttextclassification's People

Contributors

Watchers

redditcommenttextclassification's Issues

Basically

Recommend Projects

Recommend Topics

Recommend Org