Giter VIP home page Giter VIP logo

jesselyenriquez / subbreddit_classfication-log_reg-random-forest- Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 5.27 MB

Two subreddits, Conspiracy & AskPoitics postings were extracted to create a classification model to aid a political candidate's social media post so it could be classified as conspiracy discourse to inform communications teams if rewriting needed to be done, prior to posting.

Jupyter Notebook 99.76% Python 0.24%

subbreddit_classfication-log_reg-random-forest-'s Introduction

Social Media Communication Exploration

Target Clientele: Local Maine Political Candidates


Contents

  • Introduction
    • Data Collection
    • Data Cleaning and EDA Overview
  • Modeling Overview
  • Conclusions
  • Recommendations

Introduction

Bit of Background

Online sources of political media has become essential to the exchange of political content on platforms especially among major social media platforms such as Facebook, Twitter, and even Subreddit. Serra Public Affairs hopes to begin gaining traction with a new clientele base for local political candidates in Maine.

For the purposes of my project, the social media platform selected was Reddit and to begin analyzing what the varied discussion language between both I decided to create a classification model. This was an effort to begin understand what kind of communication can be classified as political discourse versus conspiracy discussions. The project is to meet a potential need of communications groups of campaigns so ultimately a piece of text could be predicted to could be classified as a Conspiracy post informing these liaisons if this text should be altered.

The reasoning between these subreddits came from experience as a phone banker in the US 2020 election where thousands of calls were made and a large amount of these calls from both parties were concerned with not only the presence of fake news in their news outlets and peer discussion but also from what they thought to be candidates messaging.

So in order for any modern day politician having clear and trustworthy communication through their social media platforms will be essential in the era of fake news.

The subreddits, Conspiracy, was selected based on their lengthy and active participation and the notorious reputation for contributing to the spread of misinformation Reference. Another goal of the project is to determine the most impactful words that are important when classifying a text as a political or conspiracy post based on its' language to inform communication liaisons.

Data Collection and Cleaning

The first stage of the project was to collect posts from each Subreddt, to do this I utilized Reddits Pushshift API it was a fairly simple process utilizing the requests library to gather a .JSON file to work with and extract necessary information. A sleep function was included as to not overwhelm the server.

The cleaning process included various forms of removing nulls, urls, and punctuation characters. A few features were extracted from the text content to potentially improve model performance, discussed more in the Logistic Regression notebook.

Preprocessing and Modeling For the purposes of the project two vectorizers were utilized and a couple of models were compared, Random Forest and Logistic Regression. however ultimately the model of interest was Logistic Regression because we could explore the most impactful words with this model results. However it is important to note that there are a variety of vectorizers and models to be explored in future steps for this project. These will be included and updated as time permits in the following weeks.

Model Evaluation and Basic Cleaning Results to Note

When initially beginning this project the metrics of interest and evaluation methods were considered and the most appropriate metric appeared to be accuracy as well as consideration of misclassification percentages through confusion matrix examinations. Since the purpose of the project was to distinguish between two forms of communication no other metrics truly called to be extremely applicable other than accuracy. As a result of our data extraction and cleaning we ended up with a fairly balanced class of approximately 50% per subreddit.


Conclusion

Ultimately we are able to get a sense of what type of words to avoid there are various ways to build up this models capabilities that can be explored further but the conclusion can be made that there is a difference between political and conspiracy discourse. A more in depth sentiment analysis will likely be helpful for future politically centered clients.

However for the preliminary exploration and purpose of this project, differentiating text between conspiracy and political discourse can be made from our final model.

Additional notes can be made from the model created we can see that there are some topics to be avoided such to have a more strengthening message to the general public. There are topics to address shown in our examination of top 25 words throughout our model iterations, where in our political discourse hot topics were those that contained 'Ukraine'. Another conclusion can be made for the events occurring in Afghanistan where these are frequent topics of discussion that appeared in our final examination process.


Future Recommendation's

More work could be done to improve this model however with the rise in fake news a very essential objective of any modern day political candidate would need to consider how they will not only improve their communications to improve their connections with the general public they hope to represent, but additional resources should be allocated to explore how we as a marketing strategy group could address this rising issue. So my recommendation is to look at the larger social media platforms and attempt to create a predictive model that campaign communications managers could utilize to gauge what audience their messaging appeals to or follows.

subbreddit_classfication-log_reg-random-forest-'s People

Contributors

jesselyenriquez avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.