Giter VIP home page Giter VIP logo

subreddit-differentiator's Introduction


Logo

Subreddit Differentiator

NLP models trained to differentiate between similar subreddits on post text.

GitHub Home · Report Bug · All Projects

Table of Contents

Table of Contents
  1. Executive Summary
  2. About The Project
  3. Process
  4. Contact

Executive Summary

If interested in eventually launching targetted recommendations/ads to users on various subreddits, this project provides multiple pre-trained models for early differentiation. Multiple models are given as a streamlined "launching point" for other future data science efforts.

(back to top)

About

Problem Statement

In the competitive and dynamic world of big data, data science teams are eager to leverage the internet's free data for insight.

This project aims to "pre-train" several NLP classification models and then provide an executive summary of the results to an existing data science client. This data science team is looking to accurately differentiate between two specific subreddits (AskReddit, AskScience) as a first step in developing targetted ads/recommendations.

Success of these pre-trained models will be based on balanced accuracy score because a "false positive" is not anymore problematic than a "false negative" in this business context. The scope of the project is limited to the data scrapped within 3 weeks on said subreddits. The model choices were limited by local compute power. The executive summary provides "future considerations" for the existing data science client, including mentions between score choice, model choice, and scope choice.

(back to top)

Built With

(back to top)

Process

Data Collection and Cleaning

Data was collected with PushShift.io (api) on the following subreddits:

  • AskReddit
  • AskScience

Each dataset was at around 12.5k posts. Given the nature of the project (executive summary + selling to data science team, the data is included in the repo.)

Provided Datasets

Preprocessing included extracting stems/lemma, removing non-English posts, fixing typos, and removing duplicate posts (reposts).

Likewise, prior to modeling, I applied CountVectorizer and Tfidf Vectorizer + standardization to the training corpus.

(back to top)

Modeling / Analysis

I applied logistic regression, random forest, and stacked model (decision tree as meta learner) on both sets, totaling 6 model comparisons.

(back to top)

Results

Selected Screenshots (EDA)

(back to top)

Conclusion

From the model results, we see that the logistic regression is actually the best model in both cases of the cvec and tfidf data.

Random forest is slightly overfit, but overall had very weak results when trying to predict the negative class (seen in the near perfect recall score but terrible precision score).

Logisitic regression was much more overfit, but when comparing the true pos/neg rates, it had a relatively equal performance both ways.

Due to the lower performance of the RF, the stacking model suffered in suit.

The final model recommendations:

  • Logistic regression if you want to prioritize balanced accuracy
  • Random forest if you want to prioritize recall

(back to top)

Contact

If you wish to contact me, Christopher Denq, please reach out via LinkedIn.

If you're curious about more projects, check out my website or GitHub.

(back to top)

subreddit-differentiator's People

Contributors

cdenq avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.