Giter VIP home page Giter VIP logo

emrcfn's Introduction

geoemr

This repo's scripts set up an EMR cluster with spatially enabled Hive. The emrcfn.yaml template sets up the cluster and relies on resources under the steps/ directory for geospatial support. The template outputs cluster attributes to facilitate nesting it within other templates.

Cluster features

Other than Hive spatial support, the cluster is configured with the following features:

  • Autoscaling for core and task nodes, to keep up with respectively HDFS and memory utilization;
  • Spot instances for task nodes, to minimize costs without risking data loss;
  • Core Hadoop: Hadoop, Ganglia, Hive, Hue, Pig, Tez and Mahout;
  • Spark: Spark and Zeppelin;
  • Presto;
  • Hadoop debugging and logging to an S3 bucket;
  • AWS Glue for the Hive and Spark metastores.

Spot instances for task nodes can be replaced with on-demand, if desired.

Spatial support

One of the steps run on cluster creation loads a jar containing two sets of spatial functions:

(If you're writing Java MapReduce jobs, ESRI's Java Geometry API is also bundled in the jar as a dependency of both of the above.) The jar file (in this repo under steps/jar/) is copied to HDFS and the relevant Hive DDL statements (under steps/sql/) set up the UDFs. If the functions are already defined, they are dropped and recreated.

The ST_* spatial functions allow queries on geographic objects. As an example, if we have one table of Census geographies, with their WKT representations, and another of lat-long locations reported by IoT devices, we can identify the Census block groups where we've observed devices as follows:

select
	io.*,
	cg.geoid
from default.iot_observations io
	cross join default.census_geo cg
where
	cg.level = 'blockgroup' and
	ST_Intersects(ST_Point(io.longitude, io.latitude),
                  st_GeomFromText(cg.wkt));

By default, however, Hive implements this query very inefficiently, by actually materializing and filtering the Cartesian product of the two tables. Because Hive doesn't natively support spatial indexes, the BT_* functions for Bing tiles can be useful for implementing your own and speeding up spatial joins.

Deployment

Deployment is as follows:

  1. Copy the steps/ directory as-is to an S3 prefix readable by the IAM user creating the cluster. You should wind up with these files under, e.g., s3://mybucket/prefixname/steps/.
  2. Run the Cloudformation template, passing the parameters
    • VpcId (the VPC in which to create the cluster),
    • Subnet (the subnet in which to create the cluster),
    • KeyPair (the SSH keypair to use for connections to the master node) and
    • SpatialInstallS3Prefix (the S3 prefix in step 1, including the bucket name but without the leading 's3://' and trailing '/', as in mybucket/prefixname rather than s3://mybucket/prefixname/).

An example AWS CLI command deploying the template is:

aws cloudformation create-stack --stack-name emrcfn \
                                --template-body "$(cat emrcfn.yaml)" \
                                --capabilities CAPABILITY_NAMED_IAM \
                                --parameters ParameterKey=VpcId,ParameterValue=vpc-XXXXXXXX \
                                             ParameterKey=Subnet,ParameterValue=subnet-XXXXXXXX \
                                             ParameterKey=KeyPair,ParameterValue=MyKeyPairName \
                                             ParameterKey=S3ConfigPrefix,ParameterValue=mybucket/prefixname

emrcfn's People

Contributors

wwbrannon avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.