Giter VIP home page Giter VIP logo

reddit-sarcasm's Introduction

Project Overview

Text Comprehension, Question and Answering, Generative Text and Sentiment Analysis have all made great strides throughout the past decade - and progress is only accelerating. Does there exist, though, a model which can predict sarcasm? The difficulty in doing so is that we may not be able to do so with current methods as the type and amount of context that is necessary is not accounted for. There are two key indicators of sarcasm:

  1. A shared knowledge between the speaker/writer and listener/reader i) eg. It's -30 degrees and windy -> "Beautiful day today, isn't it?"
  2. The speaker/writer's intent

We may or may not be able to access this shared knowledge, but maybe through NLP techniques we may infer the writer's intent. This project strives to utilize machine learning and neural networks to predict the occurence of sarcasm in a given text. In doing so it will allow users to know whether a given input is genuine or misleading.


Dataset

The data was gathered by Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli by scraping reddit comments between the years of 2009 to 2017. The data contains ~1 million comments evenly split by the labels 0 for non-sarcastic and 1 for sarcastic. The labels were generated using self-annotation, meaning the author of the comment added an /s tag at the end of their post to indicate that they were being sarcastic. Other than the comments and labels, the dataset also contains:

  • author: the creator of the comment
  • subreddit: which reddit forum it was posted on
  • score: the net upvote - downvote
  • ups: the amount of upvotes (likes)
  • downs: the amount of downvotes (dislikes)
  • date: the year and month comment was created
  • created_utc: the date and time comment was created
  • parent comment: the comment that preceeded it

Data can be found on Kaggle here and Google Drive here.

Project Workflow

Data Cleaning

  • searched for and removing negligible amount of duplicates and null values
  • converted data to appropriate datatype
  • eliminated redundancies

EDA

  • looked for patterns and relationships in data
  • isolated sections related to our target variable
  • found words associated with sarcasm

Preprocessing

  • removed unncessary features, including all non-text columns

Fitting and transforming text data using:

  • Count Vectorizer
  • TF IDF
  • Word2Vec self-trained
  • Word2Vec pre-trained

Models --> Accuracy Score

  • Logistic Regression with Count Vectorizer --> 0.65
  • Logistic Regression with Count TFIDF --> 0.65
  • KNN using TFIDF --> 0.56
  • MLP using self-trained Word2Vec --> 0.71
  • MLP using Glove-Twitter-200 --> 0.68
  • MLP using Word2Vec Google News --> 0.68

** See Requirements.txt for libraries

reddit-sarcasm's People

Contributors

sveto-g avatar shreyasbstation avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.