Giter VIP home page Giter VIP logo

ac209a-twitter-project's Introduction

AC209a Twitter Bot Detection Project

This repository contains all of the notebooks that were developed for the purpose of detecting bots on Twitter using data science and machine learning techniques, for the AC209a fall semester class at Harvard University.

Access the Github Pages website here.

The notebooks outlined the procedure of data acquisition using the tweepy Twitter API library for Python, followed by exploratory data analysis and model development. Natural language processing techniques such as topic modeling and sentiment analysis were also implemented to obtain both tweet-level features user-level features. The machine learning techniques utilized in this study were:

  • Logistic Regression
  • LDA/QDA
  • Random Forest
  • Boosting
  • K Nearest Neighbors
  • Feed Forward Artificial Neural Network
  • Support Vector Machines
  • Stacking (Meta Ensembling)
  • Blended Ensemble

These models are compared and the most informative features to determine whether a particular Twitter account is a bot or a legitimate user are discussed. A discussion is also given which aims to answer the following question:

Can we identify the ratio of bots to legitimate users using a subset of tweets about Trump?

jpg

Summary

In summary, bot detection is not a trivial task and it is difficult to evaluate the performance of our bot detection algorithms due to the lack of certainty about real users scraped from the Twitter API. Even humans cannot achieve perfect accuracy in bot detection. That said, the models we developed over the course of this project were able to detect bots with a high accuracy. They even had comparable performance to well-developed models from industry trained on datasets ten times the size of ours and with more than 1000 features. Adding NLP-related features to our models significantly improved their accuracy on our test set. However, we were most effective at bot detection on an unseen data set using only user-based features. With more time, we would train our NLP models on a larger diversity of bots and users to allow for stronger fingerprinting of bots versus legitimate users.

ac209a-twitter-project's People

Contributors

mpstewart1 avatar cstolz avatar cindy20180901 avatar tianning-zhao avatar

Stargazers

An Dang-Hieu avatar  avatar Niloofar Rezaei avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.