Twitter Feed Analysis using Spark with Hadoop

An academic project as a part of course, "Principles of Big Data Management", to develop a system to store, process, analyse, and visualize Twitter’s data using Apache Spark

Phase -1 : Hadoop & Spark's Map Reduce Word Count of URLs and HashTags from the tweets collected through Twitter API using TWARC.

Documentation

Setting UP Hadoop

Setting UP Spark

Configuring TWARC

Tweet Collection & Extraction of Urls, HashTags

1. Python Script is written to do the tweet collection through API and then extraction of urls & hashtags.
2. It seeks users choice of keyword(s) to search for the corresponding tweets and storing it into a json file.
3. Twarc command 'search' is used to collect the respective tweets with a timeout of 15 Minutes(i.e., the collection
    of tweets is suspended if the search command is not done by 15 minutes).
4. The tweets collected will be stored in a json file 'tweets_keywords'.
5. Urls and HashTags are extracted from 'tweets_keywords' into a text file 'twitter_out.txt'.
    i.  While reading Tweets, empty tweets are ignored and the tweets with at least one url or one hashtag is extracted/ 
    ii. There are several kinds of urls exist in the url entity of tweet. The main url is being extracted into text file 
        ignoring remaining kind of urls.
    iii. Similarly, all the hashtags of tweets are extracted into the output text file.

Deployment(Step by Step Execution)

1. Extracting urls & hashtags from the collected tweets for given user keyword(s)

    python twitter_extraction.py

2. Moving extracted text file 'twitter_out' from local folder to hdfs folder.

    $HADOOP_HOME/bin/hdfs dfs -put '/local/path/twitter_out.txt' /your_hdfs_folder

3. Running Hadoop MapReduce WordCount on 'twitter_out.txt' present in 'your_hdfs_folder' and place the generated output under 'output'

    $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount /your_hdfs_folder/twitter_out.txt /your_hdfs_folder/output

4. The generated output can be found in the hdfs folder named 'output'.
5. Running Spark MapReduce WordCount on twitter_extraction and extracting the wordcount output into Spark_Output file which will be stored in local folder.

    $SPARK_HOME/bin/spark-submit run-example JavaWordCount /your_hdfs_folder/twitter_out.txt | grep -v info >> Spark_Output.txt

Authors

@Chandrasekhar-Syamala

chandrasekhar-syamala / twitterfeedanalysis Goto Github PK

twitterfeedanalysis's Introduction

Twitter Feed Analysis using Spark with Hadoop

Documentation

Setting UP Hadoop

Setting UP Spark

Configuring TWARC

Tweet Collection & Extraction of Urls, HashTags

Deployment(Step by Step Execution)

Authors

twitterfeedanalysis's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent