- A Scala tool which helps deploying Apache Spark stand-alone cluster and submitting job.
- Currently supports Amazon EC2 and OpenStack with Spark 1.4.1+.
- There are two modes when using spark-deployer, SBT plugin and embedded mode.
- Set the environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
if you are deploying Spark on AWS EC2 (or if you want to access AWS S3 from your sbt or Spark cluster). - Prepare a project with structure like below:
project-root
├── build.sbt
├── project
│ └── plugins.sbt
├── spark-deployer.conf
└── src
└── main
└── scala
└── mypackage
└── Main.scala
- Write one line in
project/plugins.sbt
:
addSbtPlugin("net.pishen" % "spark-deployer-sbt" % "2.8.2")
- Write your cluster configuration in
spark-deployer.conf
(See the examples for more details). If you want to use another configuration file name, please set the environment variableSPARK_DEPLOYER_CONF
when starting sbt (e.g.$ SPARK_DEPLOYER_CONF=./my-spark-deployer.conf sbt
). - Write your Spark project's
build.sbt
(Here we give a simple example):
lazy val root = (project in file("."))
.settings(
name := "my-project-name",
version := "0.1",
scalaVersion := "2.10.6",
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.0" % "provided"
)
)
- Write your job's algorithm in
src/main/scala/mypackage/Main.scala
:
package mypackage
import org.apache.spark._
object Main {
def main(args: Array[String]) {
//setup spark
val sc = new SparkContext(new SparkConf())
//your algorithm
val n = 10000000
val count = sc.parallelize(1 to n).map { i =>
val x = scala.math.random
val y = scala.math.random
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
}
}
- Create the cluster by
sbt "sparkCreateCluster <number-of-workers>"
. You can also executesbt
first and typesparkCreateCluster <number-of-workers>
in the sbt console. You may first typespark
and hit TAB to see all the available commands. - Once the cluster is created, submit your job by
sparkSubmitJob <arg0> <arg1> ...
- When your job is done, destroy your cluster by
sparkDestroyCluster
sparkCreateMaster
creates only the master node.sparkAddWorkers <number-of-workers>
supports dynamically add more workers to an existing cluster.sparkCreateCluster <number-of-workers>
shortcut command for the above two commands.sparkRemoveWorkers <number-of-workers>
supports dynamically remove workers to scale down the cluster.sparkDestroyCluster
terminates all the nodes in the cluster.sparkRestartCluster
restart the cluster with new settings fromspark-env
without recreating the machines.sparkShowMachines
shows the machine addresses with commands to login master and execute spark-shell on it.sparkUploadFile <local-path> <remote-path>
upload a file to master, for example,sparkUploadFile ./settings.conf ~/
will uploadsettings.conf
to master's home folder.sparkUploadJar
uploads the job's jar file to master node.sparkSubmitJob <arg0> <arg1> ...
uploads and runs the job.sparkSubmitJobWithMain
uploads and runs the job (with main class provided, e.g.sparkSubmitJobWithMain mypackage.Main <args>
).sparkRemoveS3Dir <dir-name>
remove the s3 directory with the_$folder$
folder file (e.g.sparkRemoveS3Dir s3://bucket_name/middle_folder/target_folder
).sparkChangeConfig <config-key>
see the configuration page for more details.sparkRunCommand <command>
run a command on master, e.g.sparkRunCommand df -h
will show the disk usage on master.sparkRunCommands <config-key>
see the configuration page for more details.
Besides running your job with sparkCreateCluster
and sparkSubmitJob
, you can also test your job locally (e.g. on your laptop) without creating any machine. To enable this feature, add the following line in your build.sbt
(add it to the .settings(...)
block in the example above)
SparkDeployerPlugin.localModeSettings
Then, just type run
or runMain
in your sbt, spark-deployer will set the master url and app name for your SparkContext
. You don't have to download the Spark tarball or assembly the project into jar file.
If you don't want to use sbt, or if you would like to trigger the cluster creation from within your Scala application, you can include the library of spark-deployer directly:
libraryDependencies += "net.pishen" %% "spark-deployer-core" % "2.8.2"
Then, from your Scala code, you can do something like this:
import sparkdeployer._
val sparkDeployer = SparkDeployer.fromFile("path/to/spark-deployer.conf")
val numOfWorkers = 2
val jobJar = new File("path/to/job.jar")
val args = Seq("arg0", "arg1")
sparkDeployer.createCluster(numOfWorkers)
sparkDeployer.submitJob(jobJar, args)
sparkDeployer.destroyCluster()
- Environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
should also be set if you need it. - You may prepare the
job.jar
by sbt-assembly from other sbt project with Spark. - For other available functions, check
SparkDeployer.scala
in our source code.
spark-deployer uses slf4j, remember to add your own backend to see the log. For example, to print the log on screen, add
libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.14"