Giter VIP home page Giter VIP logo

spark-in-practice-scala's Introduction

Workshop spark-in-practice

In this workshop the exercises are focused on using the Spark core and Spark Streaming APIs, and also the dataFrame on data processing. Exercises are available both in Java and Scala on my github account (here in scala). You just have to clone the project and go! If you need help, take a look at the solution branch.

The original blog-post is right here.

To help you to implement each class, unit tests are in.

Frameworks used:

  • Spark 1.4.0
  • scala 2.10
  • sbt
  • scalatest

All exercises runs in local mode as a standalone program.

To work on the hands-on, retrieve the code via the following command line:

$ git clone https://github.com/nivdul/spark-in-practice-scala.git

Then you can import the project in IntelliJ or Eclipse (add the SBT and Scala plugins for Scala), or use sublime text for example.

If you want to use the interactive spark-shell (only scala/python), you need to download a binary Spark distribution.

Go to the Spark directory
$ cd /spark-1.4.0

First build the project
$ build/mvn -DskipTests clean package

Launch the spark-shell
$ ./bin/spark-shell
scala>

Part 1: Spark core API

To be more familiar with the Spark API, you will start by implementing the wordcount example (Ex0). After that we use reduced tweets as the data along a json format for data mining (Ex1-Ex3).

In these exercises you will have to:

  • Find all the tweets by user
  • Find how many tweets each user has
  • Find all the persons mentioned on tweets
  • Count how many times each person is mentioned
  • Find the 10 most mentioned persons
  • Find all the hashtags mentioned on a tweet
  • Count how many times each hashtag is mentioned
  • Find the 10 most popular Hashtags

The last exercise (Ex4) is a way more complicated: the goal is to build an inverted index knowing that an inverted is the data structure used to build search engines. Assuming #spark is a hashtag that appears in tweet1, tweet3, tweet39, the inverted index will be a Map that contains a (key, value) pair as (#spark, List(tweet1,tweet3, tweet39)).

Part 2: streaming analytics with Spark Streaming

Spark Streaming is a component of Spark to process live data streams in a scalable, high-throughput and fault-tolerant way.

Spark Streaming

In fact Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. The abstraction, which represents a continuous stream of data is the DStream (discretized stream).

In the workshop, Spark Streaming is used to process a live stream of Tweets using twitter4j, a library for the Twitter API. To be able to read the firehose, you will need to create a Twitter application at http://apps.twitter.com, get your credentials, and add it in the StreamUtils class.

In this exercise you will have to:

  • Print the status of each tweet
  • Find the 10 most popular Hashtag

Part 3: structured data with the DataFrame

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from different sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

DataFrame

In the exercise you will have to:

  • Print the dataframe
  • Print the schema of the dataframe
  • Find people who are located in Paris
  • Find the user who tweets the more

Conclusion

If you find better way/implementation, do not hesitate to send a pull request or open an issue.

Here are some useful links around Spark and its ecosystem:

spark-in-practice-scala's People

Contributors

nivdul avatar

Watchers

James Cloos avatar Tristan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.