Giter VIP home page Giter VIP logo

twitterfeedanalysis's Introduction

Twitter Feed Analysis using Spark with Hadoop

An academic project as a part of course, "Principles of Big Data Management", to develop a system to store, process, analyse, and visualize Twitter’s data using Apache Spark

Phase -1 : Hadoop & Spark's Map Reduce Word Count of URLs and HashTags from the tweets collected through Twitter API using TWARC.

Documentation

Tweet Collection & Extraction of Urls, HashTags

1. Python Script is written to do the tweet collection through API and then extraction of urls & hashtags.
2. It seeks users choice of keyword(s) to search for the corresponding tweets and storing it into a json file.
3. Twarc command 'search' is used to collect the respective tweets with a timeout of 15 Minutes(i.e., the collection
    of tweets is suspended if the search command is not done by 15 minutes).
4. The tweets collected will be stored in a json file 'tweets_keywords'.
5. Urls and HashTags are extracted from 'tweets_keywords' into a text file 'twitter_out.txt'.
    i.  While reading Tweets, empty tweets are ignored and the tweets with at least one url or one hashtag is extracted/ 
    ii. There are several kinds of urls exist in the url entity of tweet. The main url is being extracted into text file 
        ignoring remaining kind of urls.
    iii. Similarly, all the hashtags of tweets are extracted into the output text file. 

Deployment(Step by Step Execution)

1. Extracting urls & hashtags from the collected tweets for given user keyword(s)
    python twitter_extraction.py
2. Moving extracted text file 'twitter_out' from local folder to hdfs folder.
    $HADOOP_HOME/bin/hdfs dfs -put '/local/path/twitter_out.txt' /your_hdfs_folder 
3. Running Hadoop MapReduce WordCount on 'twitter_out.txt' present in 'your_hdfs_folder' and place the generated output under 'output'
    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount /your_hdfs_folder/twitter_out.txt /your_hdfs_folder/output
4. The generated output can be found in the hdfs folder named 'output'.
5. Running Spark MapReduce WordCount on twitter_extraction and extracting the wordcount output into Spark_Output file which will be stored in local folder.
    $SPARK_HOME/bin/spark-submit run-example JavaWordCount /your_hdfs_folder/twitter_out.txt | grep -v info >> Spark_Output.txt

Authors

twitterfeedanalysis's People

Contributors

chandrasekhar-syamala avatar

Watchers

 avatar  avatar

Forkers

erwinpp

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.