Giter VIP home page Giter VIP logo

wordcount-on-amazonemr's Introduction

Simple word count on Amazon EMR

The code was largely taken from this repo. However, the original code couldn't handle unicode inupt and couldn't clean out punctuations. I tweaked the code a little bit to prepocess the input file so that only lower case words without punctuations will be counted.

Steps

Create Cluster

  • Log in to your AWS console and go to EMR to create a spark cluster with the following configurataions:

    config1

  • Don't forget to choose or create a EC2 key pair so that we can ssh onto the master node later:

    config2

It will take around 15 minutes for the cluster to spin up. You should see the cluster status show Ready when it's fully up and running.

Add inbound rule on master node

  • On your console page go to Services -> EC2
  • From the left panel find Security Group
  • Select the security group named ElasticMapReduce-master and edit its inbound rule from the Inbound Rules tab
  • Add a new rule with SSH as the Type and ip of your machine that is going to be used to ssh onto the cluster as the Source

Upload input file to S3

  • Go to Services -> S3 and create a new bucket, untick block all public access to make the objects inside public to read
  • Create a folder in the bucket and upload the RomeoAndJuliet.txt file from this repo. Or any text file you want to count the words from.

Create code file on master node

  • Go back to EMR, and select the cluster you just created
  • Near the "Master public DNS:" field click the SSH button and SSH on the master node with the platform of your choice
  • In /home/hadoop create wordcount.py (vi wordcount.py)
  • Copy over the contents from wordcount.py in this repo
  • Don't forget to change the input file s3 url in the code to point to the text file in your bucket

Run the spark application

  • Still on the master node, execute the script using
    spark-submit wordcount.py | tee output.txt
    

You can view the logs and printed result of the word count application in output.txt

  • (optional)You can have the output file copied to your s3 bucket by using
    aws s3 cp output.txt s3://my_bucket/my_folder/
    

Terminate the cluster

  • Don't forget to terminate the cluster after you are done(EMR clusters will keep your bills going up even if it's not doing anything)
  • Next time you need a cluster just create a new one and reuse the same key pair

wordcount-on-amazonemr's People

Watchers

 avatar

Forkers

niravpatel27

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.