Giter VIP home page Giter VIP logo

subreddits_classification's Introduction

Subreddit Classification - Identifying Harmful Content

Profanity is not censored from data

In this series of notebooks, we explore the idea of potentially creating an automated NLP based model that can learn what constitutes behavior worthy of a subreddit ban and apply this to future posts and communities. There are two goals in this project, though only the first is expected to succeed:

  1. Training a machine learning model to correctly classifying a comment as belonging to either a parent subreddit or its banned child, using as little forum-specific vocabulary as possible.
  2. Possibly extrapolating the model to classify the banned subreddit versus any other subreddit with no further information, possibly gaining some insight as to what generally constitutes a bannable offense.

We attempt to acheive this by limiting the forum specific vocabulary as well as other words that are assumed unhelpful in detecting offensive text. The hopes are not high for the second goal, since it is expected that most other threads are not like either thread. The project is more an exploration into what we can possibly do to reach this goal using certain techniques - here, we use a Naive Bayes model and a Random Forest model. Note that we limit our attention to r/incels and r/foreveralone comments at present with the assumption that if we can get high accuracy predicting comments between these two threads, we expect better accuracy comparing incels to an unrelated subreddit. The simplifying assumptions are that r/incels is only filled with toxic comments, r/foreveralone never crosses the line, and that r/foreveralone shares its inoffensive language with other unbanned subreddits.

The first notebook restates most of this introduction and provides the code needed to pull data from the pushshift API, from which we obtain data. Not all the data is used currently, notably the submissions.

The second notebok uses a count vectorizer to gain an intuitive understanding of the comments in each subreddit. The data is then modelled using a multinomial Naive Bayes classifier.

The third notebook introduces several changes to the second, including rescaling of our numerical representation of words as well as using a different classification method. Though it is therefore difficult to determine if it is the rescaling or the new classifier that is responsible for the change in result, we simply note the increased accuracy in distinguishing the two groups and continue the discussion from there.

subreddits_classification's People

Contributors

romandtse07 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.