deanwampler / spark-scala-tutorial Goto Github PK

View Code? Open in Web Editor NEW

977.0 977.0 431.0 13 MB

A free tutorial for Apache Spark.

License: Other

Scala 15.06% Shell 0.16% Jupyter Notebook 84.76% Batchfile 0.02%

jupyter scala spark tutorial

spark-scala-tutorial's Introduction

Apache Spark Scala Tutorial - README

Dean Wampler
[email protected]
@deanwampler

This tutorial demonstrates how to write and run Apache Spark applications using Scala with some SQL. I also teach a little Scala as we go, but if you already know Spark and you are more interested in learning just enough Scala for Spark programming, see my other tutorial Just Enough Scala for Spark.

You can run the examples and exercises several ways:

Notebooks, like Jupyter - The easiest way, especially for data scientists accustomed to notebooks.
In an IDE, like IntelliJ - Familiar for developers.
At the terminal prompt using the build tool SBT.

This tutorial is mostly about learning Spark, but I teach you a little Scala as we go. If you are more interested in learning just enough Scala for Spark programming, see my new tutorial Just Enough Scala for Spark.

Notes:

The current version of Spark used is 2.3.X, which is a bit old. (TODO!)

While the notebook approach is the easiest way to use this tutorial to learn Spark, the IDE and SBT options show details for creating Spark applications, i.e., writing executable programs you build and run, as well as examples that use the interactive Spark Shell.

Acknowledgments

I'm grateful that several people have provided feedback, issue reports, and pull requests. In particular:

Getting Help

Before describing the different ways to work with the tutorial, if you're having problems, use the Gitter chat room to ask for help. You can also use the new Discussions feature for the GitHub repo. If you're reasonably certain you've found a bug, post an issue to the GitHub repo. Pull requests are welcome, too!!

Setup Instructions

Let's get started...

Download the Tutorial

Begin by cloning or downloading the tutorial GitHub project github.com/deanwampler/spark-scala-tutorial.

Now Pick the way you want to work through the tutorial:

Notebooks - Go here
In an IDE, like IntelliJ - Go here
At the terminal prompt using SBT - Go here

Using Notebooks

The easiest way to work with this tutorial is to use a Docker image that combines the popular Jupyter notebook environment with all the tools you need to run Spark, including the Scala language. It's called the all-spark-notebook. It bundles Apache Toree to provide Spark and Scala access. The webpage for this Docker image discusses useful information like using Python as well as Scala, user authentication topics, running your Spark jobs on clusters, rather than local mode, etc.

There are other notebook options you might investigate for your needs:

Open source:

Polynote - A cross-language notebook environment with built-in Scala support. Developed by Netflix.
Jupyter + BeakerX - a powerful set of extensions for Jupyter.
Zeppelin - a popular tool in big data environments

Commercial:

Databricks - a feature-rich, commercial, cloud-based service

Installing Docker and the Jupyter Image

If you need to install Docker, follow the installation instructions at docker.com (the community edition is sufficient).

Now we'll run the docker image. It's important to follow the next steps carefully. We're going to mount two local directories inside the running container, one for the data we want to use so and one for the notebooks.

Open a terminal or command window
Change to the directory where you expanded the tutorial project or cloned the repo
To download and run the Docker image, run the following command: run.sh (MacOS and Linux) or run.bat (Windows)

The MacOS and Linux run.sh command executes this command:

docker run -it --rm \
  -p 8888:8888 -p 4040:4040 \
  --cpus=2.0 --memory=2000M \
  -v "$PWD/data":/home/jovyan/data \
  -v "$PWD/notebooks":/home/jovyan/notebooks \
  "$@" \
  jupyter/all-spark-notebook

The Windows run.bat command is similar, but uses Windows conventions.

The --cpus=... --memory=... arguments were added because the notebook "kernel" is prone to crashing with the default values. Edit to taste. Also, it will help to keep only one notebook (other than the Introduction) open at a time.

The -v PATH:/home/jovyan/dir tells Docker to mount the dir directory under your current working directory, so it's available as /home/jovyan/dir inside the container. This is essential to provide access to the tutorial data and notebooks. When you open the notebook UI (discussed shortly), you'll see these folders listed.

Note: On Windows, you may get the following error: C:\Program Files\Docker\Docker\Resources\bin\docker.exe: Error response from daemon: D: drive is not shared. Please share it in Docker for Windows Settings." If so, do the following. On your tray, next to your clock, right-click on Docker, then click on Settings. You'll see the Shared Drives. Mark your drive and hit apply. See this Docker forum thread for more tips.

The -p 8888:8888 -p 4040:4040 arguments tells Docker to "tunnel" ports 8888 and 4040 out of the container to your local environment, so you can get to the Jupyter UI at port 8888 and the Spark driver UI at 4040.

You should see output similar to the following:

Unable to find image 'jupyter/all-spark-notebook:latest' locally
latest: Pulling from jupyter/all-spark-notebook
e0a742c2abfd: Pull complete
...
ed25ef62a9dd: Pull complete
Digest: sha256:...
Status: Downloaded newer image for jupyter/all-spark-notebook:latest
Execute the command: jupyter notebook
...
[I 19:08:15.017 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:08:15.019 NotebookApp]

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=...

Now copy and paste the URL shown in a browser window. (Use command+click in your terminal window on MacOS.)

Warning: When you quit the Docker container at the end of the tutorial, all your changes will be lost, unless they are in the data and notebooks directories that we mounted! To save notebooks you defined in other locations, export them using the File > Download as > Notebook menu item in toolbar.

Running the Tutorial

In the Jupyter UI, you should see three folders, data, notebooks, and work. The first two are the folders we mounted. The data we'll use is in the data folder. The notebooks we'll use are... you get the idea.

Open the notebooks folder and click the link for 00_Intro.ipynb.

It opens in a new browser tab. It may take several seconds to load.

Tip: If the new tab fails to open or the notebook fails to load as shown, check the terminal window where you started Jupyter. Are there any error messages?

If you're new to Jupyter, try Help > User Interface Tour to learn how to use Jupyter. At a minimum, you need to new that the content is organized into cells. You can navigate with the up and down arrows or clicks. When you come to a cell with code, either click the run button in the toolbar or use shift+return to execute the code.

Read through the Introduction notebook, then navigate to the examples using the table near the bottom. I've set up the table so that clicking each link opens a new browser tab.

Use an IDE

The tutorial is also set up as a using the build tool SBT. The popular IDEs, like IntelliJ with the Scala plugin (required) and Eclipse with Scala, can import an SBT project and automatically create an IDE project from it.

Once imported, you can run the Spark job examples as regular applications. There are some examples implemented as scripts that need to be run using the Spark Shell or the SBT console. The tutorial goes into the details.

You are now ready to go through the tutorial.

Use SBT in a Terminal

Using SBT in a terminal is a good approach if you prefer to use a code editor like Emacs, Vim, or SublimeText. You'll need to install SBT, but not Scala or Spark. Those dependencies will be resolved when you build the software.

Start the sbt console, then build the code, where the sbt:spark-scala-tutorial> is the prompt I've configured for the project. Running test compiles the code and runs the tests, while package creates a jar file of the compiled code and configuration files:

$ sbt
...
sbt:spark-scala-tutorial> test
...
sbt:spark-scala-tutorial> package
...
sbt:spark-scala-tutorial>

You are now ready to go through the tutorial.

Going Forward from Here

To learn more, see the following resources:

Lightbend's Fast Data Platform - a curated, fully-supported distribution of open-source streaming and microservice tools, like Spark, Kafka, HDFS, Akka Streams, etc.
The Apache Spark website.
Talks from the Spark Summit conferences.
Learning Spark, an excellent introduction from O'Reilly, if now a bit dated.

Final Thoughts

Thank you for working through this tutorial. Feedback and pull requests are welcome.

Dean Wampler

spark-scala-tutorial's People

Stargazers

Watchers

Forkers

murraju jackerxff wadetandy epishkin sdether andy1138 mubarak caique-moreira-afterverse hunter2046 dybwall1234 xuyanhui edwardt kasono eleansa r-wheeler msdevanms lmishaga seattle-jim rozhang zoosky headinthebox bviki rajasekaran18 jelmerk arpit12 qicst23 buptjk alexsisu daviddao sunlchk mindis ktoso an100 ashneo76 lanjiang itsjeevs heman-duraiswamy junejung kevinshale happyfella1 johanprinsloo oviyondtech venkateshks lucentcosmos rakesharumalla mbbroberg harlixxy umit sanketvega retroryan bigdatajungle tomithinkbig apsaltis narayana-glassbeam paytonrules ndjido itissera pb-pravin miguelperalvopm miguelperalvo vsingh58 kevin00chen gupta-himanshu anand-singh echalkpad gregnwosu reddychintu vundela bengalewsky c-st bpondi jessiedaubner asimov4 annahpryor shelleysu84 footprintes rfern akormushin momentseeker capesepias kalyankumarpichuka adilakhter kali786516 anubhavsinha timspannairisdata prog012 bulrathi logicaltime smokarizadeh test-user-bb agilemobiledev schevalier nhosgur nirajkumar pierangeloc codecraker luotitan siangyanglau firearasi witex

spark-scala-tutorial's Issues

Does not work properly in Activator

When I download the project with activator, the tutorial doesn't show up.. I think something is broken there.

Tested with Activator 1.2.7

Problem with filtering in Intro1.sc

I've got problem with running code from Intro1.sc in my repl. Stacktrace below. Any ideas how to deal with that?

scala> import org.apache.spark.SparkContext
import org.apache.spark.SparkContext

scala> import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext._

scala> // The val keyword declares a read-only variable: "value".

scala> val sc = new SparkContext("local", "Intro (1)")
14/05/29 13:45:27 WARN Utils: Your hostname, work resolves to a loopback address: 127.0.1.1; using 192.168.122.1 instead (on interface virbr0)
14/05/29 13:45:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
sc: org.apache.spark.SparkContext = org.apache.spark.SparkContext@61a6dde4

scala>

scala> // Load the King James Version of the Bible, then convert

scala> // each line to lower case, creating an RDD.

scala> val input = sc.textFile("data/kjvdat.txt").map(line => line.toLowerCase)
input: org.apache.spark.rdd.RDD[String] = MappedRDD[2] at map at <console>:12

scala>

scala> // Cache the data in memory for faster, repeated retrieval.

scala> input.cache
res1: org.apache.spark.rdd.RDD[String] = MappedRDD[2] at map at <console>:12

scala>

scala> // Find all verses that mention "sin".

scala> val count = sins.count()         // How many sins?
14/05/29 13:46:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/05/29 13:46:01 WARN LoadSnappy: Snappy native library not loaded
14/05/29 13:46:01 ERROR Executor: Exception in task ID 0
java.lang.ClassNotFoundException: $line10.$read$$iw$$iw$$iw$$iw$$anonfun$1
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
        at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
        at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
        at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
14/05/29 13:46:01 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
14/05/29 13:46:01 WARN TaskSetManager: Loss was due to java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: $line10.$read$$iw$$iw$$iw$$iw$$anonfun$1
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
        at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
        at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
        at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
        at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
14/05/29 13:46:01 ERROR TaskSetManager: Task 0.0:0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 1 times (most recent failure: Exception failure: java.lang.ClassNotFoundException: $anonfun$1)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
        at akka.actor.ActorCell.invoke(ActorCell.scala:456)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
        at akka.dispatch.Mailbox.run(Mailbox.scala:219)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Add Dataset examples (Spark 1.6.0)

Port some of the existing RDD examples to the new Dataset API and make other refinements.

WordCount2 - doesn't work with non-Ascii characters

// Exercise: Use other versions of the Bible:
//   The data directory contains similar files for the Tanach (t3utf.dat - in Hebrew),

Doesn't count any hebrew words actually, same with cyrillyc - only counted numbers and sup/font etc

Files do not get processed in SparkStreaming11Main when file is dropped manually.

Hi,

I am testing out the code on Windows 7 machine and the example for SparkStreaming11Main seems to work fine. However when I commented out on line 93:

startDirectoryDataThread(in, data)

and drop the files manually into the streaming-input folder, nothing is being processed. Do we have to setup anything to have it working manually? Thanks for your help.

Type inference doesn't work with IDE

Normally when I clone an sbt project from GitHub into IntelliJ, type inference works.

I did File > New > Project from Version Control > GitHub, cloned it, and type inference does not work.

Compilation confused about which Vector to use

When building the workshop code in local mode, either via Activator (just downloaded for the first time) or sbt (version 0.13.7, installed via Homebrew), I get the below errors about Vector. If I replace these erroring instances of Vector with scala.collection.immutable.Vector (prompted by the comment next to the first error), then everything seems happy and the tests build just fine.

I don't know the Scala ecosystem well at all, so I suspect I'm missing a build step/configuration. I'm running HotSpot 1.7.0_45-b18 - perhaps my slowness to move to Java 8 could also be an issue?

Happy to submit a patch if this isn't just a PEBKAC.

[error] /Users/colin/Practice/spark-workshop/src/main/scala/sparkworkshop/Matrix4.scala:65: object Vector is not a value
[error]       val outputLines = Vector(          // Scala's Vector, not MLlib's version!
[error]                         ^
[error] /Users/colin/Practice/spark-workshop/src/main/scala/sparkworkshop/NGrams6.scala:87: object Vector is not a value
[error]       val outputLines = Vector(
[error]                         ^
[error] /Users/colin/Practice/spark-workshop/src/main/scala/sparkworkshop/SparkStreaming8Main.scala:76: Vector does not take type parameters
[error]     def mkStreamArgs(argsSeq: Seq[String], newSeq: Vector[String]): Vector[String] =
[error]                                                                     ^
[error] /Users/colin/Practice/spark-workshop/src/main/scala/sparkworkshop/SparkStreaming8Main.scala:76: Vector does not take type parameters
[error]     def mkStreamArgs(argsSeq: Seq[String], newSeq: Vector[String]): Vector[String] =
[error]                                                    ^
[error] /Users/colin/Practice/spark-workshop/src/main/scala/sparkworkshop/SparkStreaming8Main.scala:83: value empty is not a member of object Vector
[error]     val streamArgs = mkStreamArgs(args, Vector.empty[String]).toArray
[error]                                                ^
[error] 6 errors found
[error] (compile:compile) Compilation failed

Data Download

When I ran the run.sh command, many files were downloaded somewhere and the last 10 GB of data on my computer were used, what exactly was down by that command and how can I undo whatever happened so I can regain storage on my computer.

Tests failed for WordCount3/WordCount2/InvertedIndex5b

I'm following the instructions till (building-and-testing)[https://github.com/deanwampler/spark-workshop#building-and-testing]. And I got failed tests messages. I tried googling but with no luck. I guess it's perhaps just something wrong with my installation since there is no one has the same issue. But still I want to ask here to make sure. I'm using the local setup with the browser-based IDE and added the ~/.zshrc on my Mac with ACTIVATOR_HOME = activator installation directory.

Thanks in advance,

Liwei

The followings are the error messages:

WordCount3 computes the word count of the input corpus with options
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.util.regex.PatternSyntaxException: Unknown character property name {Alphabetic} near index 17 [^\p{IsAlphabetic}]+ ^ at java.util.regex.Pattern.error(Pattern.java:1713) at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2437) at java.util.regex.Pattern.family(Pattern.java:2412) at java.util.regex.Pattern.range(Pattern.java:2335) at java.util.regex.Pattern.clazz(Pattern.java:2268) at java.util.regex.Pattern.sequence(Pattern.java:1818) at java.util.regex.Pattern.expr(Pattern.java:1752) at java.util.regex.Pattern.compile(Pattern.java:1460) at java.util.regex.Pattern.(Pattern.java:1133) at java.util.regex.Pattern.compile(Pattern.java:823) at java.lang.String.split(String.java:2292) at java.lang.String.split(String.java:2334) at WordCount3$$anonfun$2.apply(WordCount3.scala:60) at WordCount3$$anonfun$2.apply(WordCount3.scala:60) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Driver stacktrace:

WordCount2 computes the word count of the input corpus
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.util.regex.PatternSyntaxException: Unknown character property name {Alphabetic} near index 17 [^\p{IsAlphabetic}]+ ^ at java.util.regex.Pattern.error(Pattern.java:1713) at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2437) at java.util.regex.Pattern.family(Pattern.java:2412) at java.util.regex.Pattern.range(Pattern.java:2335) at java.util.regex.Pattern.clazz(Pattern.java:2268) at java.util.regex.Pattern.sequence(Pattern.java:1818) at java.util.regex.Pattern.expr(Pattern.java:1752) at java.util.regex.Pattern.compile(Pattern.java:1460) at java.util.regex.Pattern.(Pattern.java:1133) at java.util.regex.Pattern.compile(Pattern.java:823) at java.lang.String.split(String.java:2292) at java.lang.String.split(String.java:2334) at WordCount2$$anonfun$3.apply(WordCount2.scala:57) at WordCount2$$anonfun$3.apply(WordCount2.scala:57) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Driver stacktrace:

InvertedIndex5b computes the famous 'inverted index' from web crawl data
Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.util.regex.PatternSyntaxException: Unknown character property name {Alphabetic} near index 17 [^\p{IsAlphabetic}]+ ^ at java.util.regex.Pattern.error(Pattern.java:1713) at java.util.regex.Pattern.charPropertyNodeFor(Pattern.java:2437) at java.util.regex.Pattern.family(Pattern.java:2412) at java.util.regex.Pattern.range(Pattern.java:2335) at java.util.regex.Pattern.clazz(Pattern.java:2268) at java.util.regex.Pattern.sequence(Pattern.java:1818) at java.util.regex.Pattern.expr(Pattern.java:1752) at java.util.regex.Pattern.compile(Pattern.java:1460) at java.util.regex.Pattern.(Pattern.java:1133) at java.util.regex.Pattern.compile(Pattern.java:823) at java.lang.String.split(String.java:2292) at java.lang.String.split(String.java:2334) at InvertedIndex5b$$anonfun$main$2.apply(InvertedIndex5b.scala:56) at InvertedIndex5b$$anonfun$main$2.apply(InvertedIndex5b.scala:50) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:199) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:695) Driver stacktrace:

Error running WordCount3

[error] 16/01/31 23:07:55 WARN TaskSetManager: Stage 2 contains a task of very large size (118 KB). The maximum recommended task size is 100 KB.

Doesn't work in activator

I run
./activator new spark-workshop spark-workshop

and all I got was:

Fetching the latest list of templates...

OK, application "spark-workshop" is being created using the "spark-workshop" template.

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://default/user/template-cache#-69972476]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)

Link to Spark tutorial does not work

in tutorial/Readme.md, line 1397 (the last one in this snippet):

Going Forward from Here

This template is not a complete Apache Spark tutorial. To learn more, see the following:

The Apache Spark website.
The Apache Spark tutorial

The link to the tutorial leads to a 404 not found.

Formatting is off towards the bottom

Looks like a code section wasn't properly terminated.

Upgrade to Spark 1.3

The exercises and tutorial should be upgraded to Spark 1.3, including use of the new DataFrame API.

set test issue

Hello I'm having a bit of an issue with the sbt test section of the tutorial.

From the spark-scala-tutorial root directory:

{<@_@>}~/spark-scala-tutorial sbt
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0
[info] Loading project definition from /Users/srynobio1/spark-scala-tutorial/project
[info] Updating {file:/Users/srynobio1/spark-scala-tutorial/project/}spark-scala-tutorial-build...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Compiling 1 Scala source to /Users/srynobio1/spark-scala-tutorial/project/target/scala-2.10/sbt-0.13/classes...
[info] Set current project to spark-scala-tutorial (in build file:/Users/srynobio1/spark-scala-tutorial/)

> test
[info] Updating {file:/Users/srynobio1/spark-scala-tutorial/}SparkWorkshop...
[info] Resolving jline#jline;2.12.1 ...
[info] downloading https://repo1.maven.org/maven2/org/apache/avro/avro/1.7.7/avro-1.7.7.jar ...
[info] 	[SUCCESSFUL ] org.apache.avro#avro;1.7.7!avro.jar (933ms)
[info] Done updating.
[info] Compiling 38 Scala sources to /Users/srynobio1/spark-scala-tutorial/target/scala-2.11/classes...
[error] /Users/srynobio1/spark-scala-tutorial/src/main/scala/sparktutorial/NGrams6.scala:98: object Vector is not a value
[error]       val outputLines = Vector(
[error]                         ^
[error] /Users/srynobio1/spark-scala-tutorial/src/main/scala/sparktutorial/SparkStreaming11Main.scala:76: Vector does not take type parameters
[error]     def mkStreamArgs(argsSeq: Seq[String], newSeq: Vector[String]): Vector[String] =
[error]                                                                     ^
[error] /Users/srynobio1/spark-scala-tutorial/src/main/scala/sparktutorial/SparkStreaming11Main.scala:76: Vector does not take type parameters
[error]     def mkStreamArgs(argsSeq: Seq[String], newSeq: Vector[String]): Vector[String] =
[error]                                                    ^
[error] /Users/srynobio1/spark-scala-tutorial/src/main/scala/sparktutorial/SparkStreaming11Main.scala:83: value empty is not a member of object Vector
[error]     val streamArgs = mkStreamArgs(args, Vector.empty[String]).toArray
[error]                                                ^
[error] /Users/srynobio1/spark-scala-tutorial/src/main/scala/sparktutorial/solns/Matrix4StdDev.scala:69: object Vector is not a value
[error]       val outputLines = Vector(          // Scala's Vector, not MLlib's version!
[error]                         ^
[error] 5 errors found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 8 s, completed Feb 25, 2017 4:22:40 PM

My scala version:

{<@_@>}~/spark-scala-tutorial scala -version
Scala code runner version 2.12.1 -- Copyright 2002-2016, LAMP/EPFL and Lightbend, Inc.

my sbt version:

[info] This is sbt 0.13.11

my java version:

{<@_@>}~ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)

Unable to run WordCount3 on my hadoop cluster

Hi Owner!!

My WordCount3 is running successfully locally. I can the output folder with the output files in it. However when I run them on hadoop cluster using hadoop.HWordCount3 it displays an error.

[info] running: spark-submit --class WordCount3 ./target/scala-2.11/spark-scala-tutorial_2.11-5.0.0.jar --out /user/root/output/kjv-wc3
[info]
[error] Exception in thread "main" java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
[error]         at util.CommandLineOptions$$anonfun$3.apply(CommandLineOptions.scala:34)
[error]         at util.CommandLineOptions$$anonfun$3.apply(CommandLineOptions.scala:34)
[error]         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[error]         at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[error]         at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
[error]         at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
[error]         at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[error]         at scala.collection.AbstractTraversable.map(Traversable.scala:105)
[error]         at util.CommandLineOptions.apply(CommandLineOptions.scala:34)
[error]         at WordCount3$.main(WordCount3.scala:26)
[error]         at WordCount3.main(WordCount3.scala)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]         at java.lang.reflect.Method.invoke(Method.java:498)
[error]         at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
[error]         at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
[error]         at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
[error]         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
[error]         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[info]
[info] Contents of the output directories:
[error] ls: `/user/root/output': No such file or directory
[info]
[info]  **** To see the contents, open the following URL(s):
[info]
[info]
[success] Total time: 17 s, completed Dec 30, 2016 7:16:53 PM

Do I have to create a directory in my hadoop cluster? as /user/root/output? When I look into the scala code of hadoop for WordCount3, I feel like it's incomplete, but I am not sure. Please suggest!!

deanwampler / spark-scala-tutorial Goto Github PK

spark-scala-tutorial's Introduction

Apache Spark Scala Tutorial - README

Acknowledgments

Getting Help

Setup Instructions

Download the Tutorial

Using Notebooks

Installing Docker and the Jupyter Image

Running the Tutorial

Use an IDE

Use SBT in a Terminal

Going Forward from Here

Final Thoughts

spark-scala-tutorial's People

Stargazers

Watchers

Forkers

spark-scala-tutorial's Issues

Going Forward from Here

Recommend Projects

Recommend Topics

Recommend Org