vaquierm / redditcommenttextclassification Goto Github PK
View Code? Open in Web Editor NEW💬 Classification of Reddit comments to the subreddit they were posted in
💬 Classification of Reddit comments to the subreddit they were posted in
Make a dictionary class.
Make it serializeable and deserializable. (Writable to a file path and instantiable from a file path)
It will contain all the possible words that we support as features in order.
Make it so that we can get the index of a certain word in in the feature vector.
We should also keep the word weights in here as well. This can be defaulted to all 1s.
Interface:
WordDictionary(fequencies: bool) # Frequencies or binary presence of word
fromFile(filepath): WordDictionary
toFile(filePath): void
toFeatureVector(comment): np.array
Create some crazy magic ensemble method that's going to do so so good hopefully
Has some pre implemented features to do so
Get some general statistics about the data that could be interesting to talk about for the report.
All of this should be in one jupyter notebook file
ex:
Make a pretty readme with pictures and all that jazz
For example:
Map all words
no, no, noo, nooo, etc... -> no
There might be some API that can do this for us
Use some API that can get the average sentiment of a word, we want to maybe get the average sentiment of the thread to add a feature.
Create a way that given the path file of the raw dataset, we can convert it fully to a np array based on a specific dictionary.
The dictionary will basically say:
Also write a function so that this could be saved to a file. So we can directly get the csv data of the processed data for training rather than have to clean it completely every single time
Create the Initial Directory Structure for the project.
Create all the template files for urgent tasks
You must implement a Bernoulli Naive Bayes model (i.e., the Naive Bayes model from Lecture 5) from scratch (i.e., without using any external libraries such as SciKit learn).
Hint : you many want to use Laplace smoothing with your Bernoulli Naive Bayes model.
You must develop a model validation pipeline (e.g., using k-fold cross validation or a held-out validation set) and report on the performance of the above mentioned model variants.
You must run experiments using at least two different classifiers from the SciKit learn package (which are not Bernoulli Naive Bayes). Possible options are:
Also we will have to run this on the Naive Bayes model made from scratch as well.
We want a script that can take as input
models_to_run = ["NAIVE_BAYES", "SVM", "LR" ...]
dictionary_names = ["name", "of", dictionaries, "to", "run"]
Run all of the models against all the specified dictionaries. Generate some confusion matrices using sklearn for each combinations as well as
for each dictionary, generate a bar graph for the performance of each model and a text file containing the accuracies
If the words are useless, remove them
Use the SKlearn PCA to get the k most dominant eigen vectors to considerably reduce the dimentionality of the data to make training/prediction faster.
This is quite important since out model will probably be some crazy ensemble method.
@jawaialler In the validation pipeline, we want to create 3 types of files. For each run, we want to save a concussion matrix. Named “LEMMA_BINARY_MODEL_confusion.png
For each vocab and vectorizer we want to create a text file LEMMA_BINARY_accuracies.txt
Containing
Accuracies for vocabulary LEMMA and vectorizer BINARY
LR: 52%
NB: 55%
SVM: 49%
...
Maybe a picture of a bargraph showing these accuracies as well for each model?
Find which words are the most related to a subreddit (todo in the data_analysis jupyter notebook)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.