Predicting popularity of a Reddit post

What is Reddit?

Reddit is a social sharing website where you could post links, pictures, text and other users can upvote or downvote aparticular post based on if they like the post or not. If the post gets a high upvote score, then the post moves up so that it is visible to evryone. Reddit is a huge site, but it's divided into thousands of smaller communities called subreddits. In this project, the posts from "Popular" subreddit were used to prepare the dataset.

About the Project:

This project is about predicting the popularity of a Reddit post. The popularity of a Reddit post is determined by the total votes or score it gets. Score is the result of upvotes and downvotes for a particular post. So, it was identified as a regression problem.

About the Dataset:

The dataset for this project was created using web scarpping with the help of praw library. The features exracted are:

Title - title of the post
Gilded - rewarding a Reddit gold to the post
Over_18 - True if the post has adult content else False
Ups - no of upvotes for the post
Downs - no of downvotes for the post
Num_of_comments - no of comments for the post
Upvote_ratio - upvote ratio of the post
Score - total score (upvotes - downvotes)

Google Drive link for dataset (ScrappedPostsData.csv)

Sentiment Analysis:

Extra features were added with the help of Sentiment analysis for the title of the post using vaderSentiment analyzer. We get 4 columns neg, neu, pos and compound. These features tell how negative or positive the statement is. These columns were combined to one column, Predited_value, using the compound score.

positive sentiment: (compound score >= 0.05); neutral sentiment: (compound score > -0.05) and (compound score < 0.05); negative sentiment: (compound score <= -0.05)

Text pre processing:

Text preprocessing was done for the title of the post by removing punctuations, stop words, and performing stemming and lemmatization. To convert the title to numeric form, Glove embedding was used. This gives a numeric vector for all the unique words in the text. These vectors have 100 dimensions. To get one vector representation for each title, weighted average method was used. The mean of all word vectors for a particular title is taken to form one vector so that the title is represented using one vector. One hot encoding for the Predicted_value and Over_18 columns was performed.

Machine Learning:

As it is a regression problem, regression models like Linear regression, Decision tree regressor, Random forest regressor, KNN regressor, Lasso, Ridge, ElasticNet and XGBoost regressor were used. These models were trained with 60% train data and prediction was done using 40% test data. The performance of these models was measured based on test accuracy. Out of all these models, XGBoost regressor and Random forest regressor performed well with around 50% accuracy on test dataset.

Deployment:

The application was deployed on Heroku. The application takes a Redidt post URl as input and the required features are extracted from the URL. The deployed application was tested with different Reddit post URLs. As the accuracy of the model is around 50%, the predictions were a little different from expected. In future work, more data can be used to train the model to get good accuracy.

Link to Deployed Application : https://reddit-post-score.herokuapp.com/

Installing required librarires:

Installing keras and tenserflow:

pip install keras==2.2.4
pip install tensorflow==2.0

Installing nltk:

import nltk
nltk.download('popular')

Installing xgboost:

pip install xgboost

Installing vaderSentiment:

pip install vaderSentiment

Installing praw:

pip install praw

alinasahoo / reddit-post-scores Goto Github PK

reddit-post-scores's Introduction

Predicting popularity of a Reddit post

What is Reddit?

About the Project:

About the Dataset:

Sentiment Analysis:

Text pre processing:

Machine Learning:

Deployment:

Installing required librarires:

reddit-post-scores's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent