PSMSC-DockerSparkHDFS

The objective of this project is to implement an application scenario which illustrates the use of the following techniques:

Docker: it can be used to deploy a cluster of (virtual) machines on your laptop, but can also be used in a distributed setting with several laptops.
Spark: a spark infrastructure, including hdfs, a master and several slaves must be deployed in the docker infrastructure.

It refers to the track Performance in Software, Media, and Scientific Computing of the MSc course Cloud Computing and Big Data given at Toulouse INP-E.N.S.E.E.I.H.T. Eng. School and Paul Sabatier Faculty of Science and Engineering.

Getting Started

Clone first this git repository, and go into the main folder.

Prerequisites

You first need to install Docker. On Ubuntu:

wget -qO- https://get.docker.com/ | sh
sudo apt-get install ufw
sudo usermod -aG docker $USER

Running

Locally

With Docker installed, execute the following script:

cd local/
./reset-ccbd.sh -i
./start-ccbd.sh <n>

with n being the total number of container which will be built (1 master + (n-1) slaves)

It will build all the docker images, run all the containers and get you into hadoop master container under root user and execute hadoop.

Now all you need to do in order to run the Word Count example is to execute the following lines:

cd examples/
./start-wordcount.sh

The time it took, for your configuration, to count the words in file-wordcount.txt is displayed at the end of the execution.

Remotely

Install Docker on every guest host which will be used.

With Docker installed on every guest host which will be used, set each IP address as static (be sure to keep internet access).

Every worker host communicates its RSA key to the manager who saves them in his authorized_keys.

Modify the set-configuration.sh on the manager as follows :

Put the manager's IP address

Modify the set-configuration.sh on the workers as follows :

Put the manager's host name
Put the IP address of the manager
Put the IP address of the worker

On every guest host, execute the following script :

cd remote/
sudo ./set-ports.sh

(This script only works on Linux distributions)

On the manager guest host execute the following script :

cd manager/
./reset-ccbd.sh -i
./start-ccbd.sh <n> <m>

with n being the total number of container on the manager host (master + slaves, default: 3) and m being the total number of remote slaves on the cluster (default: 2). It will build all the docker images locally, run all the containers and get you into hadoop master container under root user.

When the previous script is finished, each worker executes the following script :

cd worker/
./reset-ccbd.sh -i
./start-ccbd.sh <i> <n>

with i being the initial index (number of slaves already launched + 1, default: 3) and n being the number of slaves on this host (default: 2). If you launch only one worker, all the default numbers are relevant. You now do not have to do anything on the worker guest hosts. Just make sure to keep them powered on.

On the manager host, execute the following script:

start-hadoop.sh
cd examples/
./start-wordcount.sh

The time it took spark to count the words in file-wordcount.txt is displayed and saved /tmp/time-wordcount.log.

Built With

Docker - A computer program that performs operating-system-level virtualization
Spark - A unified analytics engine for large-scale data processing.
Hadoop - An open-source software for reliable, scalable, distributed computing.

Authors

Guillaume Hugonnard - MSc-PSMSC, INPT-ENSEEIHT Eng. School and Paul Sabatier Faculty of Science and Engineering - GuillaumeHugonnard
Tom Ragonneau - MSc-PSMSC, INPT-ENSEEIHT Eng. School and Paul Sabatier Faculty of Science and Engineering - TomRagonneau

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details

Acknowledgments

Daniel Hagimont - MSc-PSMSC speaker, INPT-ENSEEIHT Eng. School and Paul Sabatier Faculty of Science and Engineering

ragonneau / psmsc-dockersparkhdfs Goto Github PK