Giter VIP home page Giter VIP logo

galvanizedatascience / building-spark-applications-live-lessons Goto Github PK

View Code? Open in Web Editor NEW
66.0 21.0 82.0 101.09 MB

Supporting content (slides and exercises) for the Addison-Wesley (Pearson) video series covering best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.

Home Page: http://bit.ly/spark-ds-videos

Python 0.26% Shell 0.10% Jupyter Notebook 99.65%

building-spark-applications-live-lessons's Introduction

Building Spark Applications Live Lessons

Join the chat at https://gitter.im/zipfian/building-spark-applications-live-lessons Binder

This repository contains the exercises and data for the Building Spark Applications Live Lessons video series. It provides data scientists and developers with a practical introduction to the Apache Spark framework using Python, R, and SQL. Additionally, it covers best practices for developing scalable Spark applications for predictive analytics in the context of a data scientist's standard workflow.

Materials

The corresponding videos can be found on the following sites for purchase:

In addition to the videos there are many other resources to provide you support in learning this new technology:

And/or please do not hesitate to reach out to me directly via email at [email protected] or over twitter @clearspandex

If you find any errors in the code or materials, please open a Github issue in this repository

Skill Level

Beginning/Intermediate

What You Will Learn

  • How to install and set up a Spark environment locally and on a cluster
  • The differences between and the strengths of the Python, R, and SQL programming interfaces
  • How to build a machine learning model for text
  • Common data science use cases that Spark is especially well-suited to solve
  • How to tune a Spark application for performance
  • The internals of the Spark framework and its execution model
  • How to use Spark in a data science application workflow
  • The basics of the larger Spark ecosystem

Who Should Take This Course

  • Practicing Data scientists who already use Python or R and want to learn how to scale up their analyses with Spark.
  • Data Engineers who already use Java/Scala for Spark but want to learn about the Python, R, and SQL APIs and understand how Spark can be used to solve Data Science problems.

Prerequisites

  • Basic understanding of programming (Python a plus).
  • Familiarity with the data science process and machine learning are a plus.

Setup

SparkR with a Notebook

  1. Install IRKernel
install.packages(c('rzmq','repr','IRkernel','IRdisplay'), repos = c('http://irkernel.github.io/', getOption('repos')))

IRkernel::installspec()
  1. Set environment variables:
# Example: Set this to where Spark is installed
Sys.setenv(SPARK_HOME="/Users/[username]/spark")

# This line loads SparkR from the installed directory
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

# if these two lines work, you are all set
library(SparkR)
sc <- sparkR.init(master="local")

Data

IPython Console Help

Q: How can I find out all the methods that are available on DataFrame?

  • In the IPython console type sales.[TAB]

  • Autocomplete will show you all the methods that are available.

  • To find more information about a specific method, say .cov type help(sales.cov)

  • This will display the API documentation for that method.

Spark Documentation

Q: How can I find out more about Spark's Python API, MLlib, GraphX, Spark Streaming, deploying Spark to EC2?

  • Go to https://spark.apache.org/docs/latest

  • Navigate using tabs to the following areas in particular.

  • Programming Guide > Quick Start, Spark Programming Guide, Spark Streaming, DataFrames and SQL, MLlib, GraphX, SparkR.

  • Deploying > Overview, Submitting Applications, Spark Standalone, YARN, Amazon EC2.

  • More > Configuration, Monitoring, Tuning Guide.

Books

Staying Involved

building-spark-applications-live-lessons's People

Contributors

jonathandinu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

building-spark-applications-live-lessons's Issues

AttributeError: '_csv.reader' object has no attribute 'next()'

I am at lesson 3.3 and rdd_csv_corrrect = rdd_no_header.map(lambda line: csv.reader([line]).next()) is giving an error in .next(). i am using python 3 here.

The error is AttributeError: '_csv.reader' object has no attribute 'next()'

I think this is more of an python2 v/s python3 issue, but need clarification as to how to resolve as this is becoming a show stopper :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.