Giter VIP home page Giter VIP logo

reddit-explore-uu's Introduction

REDDIT EXPLORE PROJECT

In this project we have used comments from reddit to play around with multiple functionalities of Apache Spark, HDFS and Docker. The main objective of the project was to get our hands dirty with setting up containerized clusters of Apache Spark, HDFS using Docker functionalities. This was done as a part of a grading requirement for a Data Engineering course.

The collaborators of this project are:

The main phases of the project were as follows:

  1. Setup a multi-container Spark-HDFS cluster using docker-compose.
  2. Run multiple analysis on the dataset provided - Reddit Comments.
  3. Choose few analysis pipelines and run experiments to find out the performance with variable worker nodes.

1. Setting up the multi-container Spark-HDFS cluster

The base architecture we setup looks as follows:

Base Architecture

STEP 1: INSTALLATION OF PRE-REQUISITES FOR THE VM

  • Install pip, Jupyter and PySpark

    sudo apt install python3-pip
    pip3 install Jupyter
    pip install pyspark
    
  • Spark Installation

    Download the tarball for Spark 3.2.3

    wget https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
    

    Extract the tarball

    tar -xvf spark-3.2.3-bin-hadoop3.2.tgz
    

    Remove the zipped file

    rm spark-3.2.3-bin-hadoop3.2.tgz
    

    Move the directory /opt/ and rename it as spark

    sudo mv spark-3.2.3-bin-hadoop3.2/ /opt/spark
    

    Save SPARK_HOME in Environment Variables (.bashrc)

    nano .bashrc
    

    Add the following to the .bashrc file

    export SPARK_HOME=/opt/spark
    export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
    
  • Install Docker using official documentation: Docker Installation

STEP 2: SETUP THE PROJECT REPOSITORY

  • Clone the reddit-explore-uu repository

    git clone https://github.com/dininduviduneth/reddit-explore-uu.git
    

STEP 3: BUILD DOCKER IMAGES AND START CLUSTER

  • Build and start the base cluster

    docker-compose up -d
    
  • Verify successful completion

    Check if the required images are created

    docker images
    

    The images bde2020/hadoop-namenode , bde2020/hadoop-datanode, docker-config_spark-master and docker-config_spark-worker must have been created.

    Check if the containers are up and running

    docker ps
    

    There should be three containers running - one each for spark-master, spark-worker, hadoop namenode and hadoop datanode.

STEP 4: UPLOADING DATA TO HDFS

  • Download the required Reddit dataset to the VM
    wget https://files.pushshift.io/reddit/comments/[REDDIT_DATASET_FILENAME].zst
    
  • Extract the dataset
    unzstd --ultra --memory=2048MB -d [REDDIT_DATASET_FILENAME].zst
    
  • Delete the zipped file
    rm [REDDIT_DATASET_FILENAME].zst
    
  • Push the file to HDFS using -put command
    hdfs dfs -put [REDDIT_DATASET_FILENAME] /
    
  • Once uploaded the dataset should be visible at the Web UI of HDFS - http://[PUBLIC IP OF VM]:9870/

STEP 5: STARTING JUPYTER AND RUNNING ANALYSIS CODE

  • Start a Jupyter server
python3 -m notebook --ip=* --no-browser
  • Navigate to data-analysis within the reddit-explore-uu directory and run the analysis notebooks.

STEP 6: SCALING

We have used Docker's docker-compose scale functionality to autoscale our spark worker nodes. The scaled setup for seven worker nodes looks as follows:

Scaled Architecture

  • To scale the workers by X
    docker-compose scale spark-worker=X
    

STEP 7: SHUTTING DOWN

  • To stop a scaled cluster, first scale down to 1 worker node

    docker-compose scale spark-worker=1
    
  • To stop the containers

    docker-compose stop
    

reddit-explore-uu's People

Contributors

dininduviduneth avatar gabriel-arteaga avatar tayyabhasan avatar atefeharemian avatar

Watchers

 avatar

reddit-explore-uu's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.