Giter VIP home page Giter VIP logo

emrskillsession's Introduction

EMRSkillSession

Hive schema:

create external table nyc_trips_pq 
(
vendor_name         string,                                  
trip_pickup_datetime string,                                  
trip_dropoff_datetime string,                                  
passenger_count     int,                                  
trip_distance       float,                                  
payment_type        string,                                  
are_amt             float,                                   
surcharge           float,                                   
mta_tax             float,                                   
tip_amt             float,                                   
tolls_amt           float,                                   
total_amt           float
)
PARTITIONED BY (year String, month String)  
STORED AS PARQUET
LOCATION 's3://neilawspublic/dataset2/'
tblproperties ("parquet.compress"="SNAPPY");

msck repair table nyc_trips_pq;

Zeppelin Code:

%pyspark
from pyspark.sql import SparkSession
from  pyspark.sql import SQLContext

hivetablename='nyc_trips_pq'
sqltext='Select year, month,sum(Passenger_Count) as total_passengers, count(1) as total_trips from nyc_trips_pq group by year, month order by 4 DESC LIMIT 10'

%pyspark

spark = SparkSession.builder.appName("Zeppelin-Spark").enableHiveSupport().getOrCreate()
df=spark.sql(sqltext)
df.createOrReplaceTempView("nyc_trips")
sql=SQLContext(spark)
sql.cacheTable("nyc_trips")
df.show()


%sql
Select * from nyc_trips

%sql
Select * from nyc_trips

R-Studio:

EMR Launch Bootstrap:

Script Location:s3://aws-bigdata-blog/artifacts/aws-blog-emr-rstudio-sparklyr/rstudio_sparklyr_emr5.sh

Script Arguments: --rstudio --shiny --sparkr --rexamples --plyrmr --rhdfs --sparklyr

R Code

>> library(sparklyr)
>> sc <- spark_connect(master = "yarn-client", version = "2.1.0")
library(DBI)
>> nyc_trips_preview <- dbGetQuery(sc, "Select year, month,sum(Passenger_Count) as total_passengers, count(1) as total_trips from nyc_trips_pq group by year, month order by 4 DESC LIMIT 10")
>> library(ggplot2)
>> bp<- ggplot(nyc_trips_preview, aes(x="", y=total_passengers, fill=year))+
geom_bar(width = 1, stat = "identity")
>> pie <- bp + coord_polar("y", start=0)
>> require(scales)
>> pie + scale_y_continuous(labels = comma)

Blog Post here: More Details here: https://aws.amazon.com/blogs/big-data/running-sparklyr-rstudios-r-interface-to-spark-on-amazon-emr/

Jupyter

EMR Launch Bootstrap:

Script Location:s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh

Script Arguments: --r --julia --toree --torch --ruby --ds-packages --ml-packages --python-packages 'ggplot nilearn' --port 8880 --password jupyter --jupyterhub --jupyterhub-port 8001 --cached-install --notebook-dir,s3://neilawstemp/notebooks/ --copy-samples

Jupyter Pyspark Code

from pyspark.sql import SparkSession
from  pyspark.sql import SQLContext

hivetablename='nyc_trips_pq'
sqltext='Select year, month,sum(Passenger_Count) as total_passengers, count(1) as total_trips from nyc_trips_pq group by year, month order by 4 DESC LIMIT 10'

spark = SparkSession.builder.appName("Zeppelin-Spark").enableHiveSupport().getOrCreate()

df=spark.sql(sqltext)
df.createOrReplaceTempView("nyc_trips")
sql=SQLContext(spark)
sql.cacheTable("nyc_trips")
df.show()

Blog Post here: More Details here: https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/

emrskillsession's People

Contributors

nmukerje avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.