The objective of this project is to implement an application scenario which illustrates the use of the following techniques:
- Docker: it can be used to deploy a cluster of (virtual) machines on your laptop, but can also be used in a distributed setting with several laptops.
- Spark: a spark infrastructure, including hdfs, a master and several slaves must be deployed in the docker infrastructure.
It refers to the track Performance in Software, Media, and Scientific Computing of the MSc course Cloud Computing and Big Data given at Toulouse INP-E.N.S.E.E.I.H.T. Eng. School and Paul Sabatier Faculty of Science and Engineering.
Clone first this git repository, and go into the main folder.
You first need to install Docker. On Ubuntu:
wget -qO- https://get.docker.com/ | sh
sudo apt-get install ufw
sudo usermod -aG docker $USER
With Docker installed, execute the following script:
cd local/
./reset-ccbd.sh -i
./start-ccbd.sh <n>
with n being the total number of container which will be built (1 master + (n-1) slaves)
It will build all the docker images, run all the containers and get you into hadoop master container under root user and execute hadoop.
Now all you need to do in order to run the Word Count example is to execute the following lines:
cd examples/
./start-wordcount.sh
The time it took, for your configuration, to count the words in file-wordcount.txt is displayed at the end of the execution.
Install Docker on every guest host which will be used.
With Docker installed on every guest host which will be used, set each IP address as static (be sure to keep internet access).
Every worker host communicates its RSA key to the manager who saves them in his authorized_keys.
Modify the set-configuration.sh on the manager as follows :
- Put the manager's IP address
Modify the set-configuration.sh on the workers as follows :
- Put the manager's host name
- Put the IP address of the manager
- Put the IP address of the worker
On every guest host, execute the following script :
cd remote/
sudo ./set-ports.sh
(This script only works on Linux distributions)
On the manager guest host execute the following script :
cd manager/
./reset-ccbd.sh -i
./start-ccbd.sh <n> <m>
with n being the total number of container on the manager host (master + slaves, default: 3) and m being the total number of remote slaves on the cluster (default: 2). It will build all the docker images locally, run all the containers and get you into hadoop master container under root user.
When the previous script is finished, each worker executes the following script :
cd worker/
./reset-ccbd.sh -i
./start-ccbd.sh <i> <n>
with i being the initial index (number of slaves already launched + 1, default: 3) and n being the number of slaves on this host (default: 2). If you launch only one worker, all the default numbers are relevant. You now do not have to do anything on the worker guest hosts. Just make sure to keep them powered on.
On the manager host, execute the following script:
start-hadoop.sh
cd examples/
./start-wordcount.sh
The time it took spark to count the words in file-wordcount.txt is displayed and saved /tmp/time-wordcount.log.
- Docker - A computer program that performs operating-system-level virtualization
- Spark - A unified analytics engine for large-scale data processing.
- Hadoop - An open-source software for reliable, scalable, distributed computing.
- Guillaume Hugonnard - MSc-PSMSC, INPT-ENSEEIHT Eng. School and Paul Sabatier Faculty of Science and Engineering - GuillaumeHugonnard
- Tom Ragonneau - MSc-PSMSC, INPT-ENSEEIHT Eng. School and Paul Sabatier Faculty of Science and Engineering - TomRagonneau
See also the list of contributors who participated in this project.
This project is licensed under the MIT License - see the LICENSE file for details
- Daniel Hagimont - MSc-PSMSC speaker, INPT-ENSEEIHT Eng. School and Paul Sabatier Faculty of Science and Engineering