- Clone this repo to your local machine.
- We'll start off by creating an AWS EMR cluster, just as in the first assignment. Head over to AWS EMR and get started.
- Click on Create cluster and configure as per below -
- The cluster remains in the 'Starting' state for about 10 - 15 minutes. Once the cluster is ready for use, the status will change to 'Waiting'. You can now go ahead and use it.
- Click on "Learn how to create an EC2 key pair" to create and modify your EC2 key pair.
- On the left top corner goto Services->EC2
- On the left hand panel goto Security Groups under Network & Security
- Select the group named "ElasticMapReduce-master" and click Edit in the Inbound tab below
- Add rule, select SSH for type and My IP as source. Save
- Now head over to Services->S3 and create a bucket named csds
- In the bucket, create a folder named csds-spark-emr
- Upload the input.txt file from this repo
- In permissions, tick the box for read everywhere. Nothing to do in properties
- Head forward and submit the file
- Click on the uploaded file and click the Make public button just to make sure
- Now on our created cluster page (Cluster list->our cluster)
- Near the "Master public DNS:" field click the SSH button
- Follow the instructions and SSH on the master node
- In /home/hadoop create wordcount.py (vi wordcount.py)
- Copy over the contents from wordcount.py in this repo
- In wordcount.py change the input file s3 url to point to input.txt in your bucket, created above
- Save
- Go through the code in wordcount.py and checkout what it does
- Execute the script using "spark-submit wordcount.py | tee output.txt"
- This will also generate output.txt with a copy of the logs
- You may have the output file copied to your s3 bucket by using the cmd "aws s3 cp output.txt s3://my_bucket/my_folder/"
- You should see the result of your code among other logs, should look like
And: 2
on: 1
then: 1
Aberbrothok: 2
bell: 1
that: 1
of: 2
knew: 1
Had: 1
placed: 1
Abbot: 2
they: 1
worthy: 1
blest: 1
Rock: 2
Inchcape: 1
the: 3
The: 1
perilous: 1
- You're encouraged to play around with the code, check out the documentation and try things out
- Don't forget to terminate your cluster after you're done
- You'll need to follow the same steps next time you create a new cluster with the exception of creating private key for SSH, you can use the same private key for all clusters
- Also make sure to allow inbound SSH traffic on the master every time your machine changes IP, which might happen when you switch between WiFi networks