Created on 07-07-2016 06:17 PM
In order to submit jobs to Spark, so called "fat jars" (containing all dependencies) are quite useful. If you develop your code in Scala, "sbt" (http://www.scala-sbt.org) is a great choice to build your project. The following relies on the newest version, sbt 0.13
For fat jars you first need "sbt-assembly" (https://github.com/sbt/sbt-assembly). Assuming you have the standard sbt folder structure, the easiest way is to add a file "assembly.sbt" into the "project" folder containing one line
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
The project structure now looks like (most probably without the "target" folder which will be created upon building the project)
MyProject +-- build.sbt +-- project | +-- assembly.sbt +-- src | +-- main | +-- scala | +-- MyProject.scala +-- target
For building Spark Kafka Streaming jobs on HDP 2.4.2, this is the build file "build.sbt"
name := "MyProject" version := "0.1" scalaVersion := "2.10.6" resolvers += "Hortonworks Repository" at "http://repo.hortonworks.com/content/repositories/releases/" resolvers += "Hortonworks Jetty Maven Repository" at "http://repo.hortonworks.com/content/repositories/jetty-hadoop/" libraryDependencies ++= Seq( "org.apache.spark" % "spark-streaming_2.10" % "1.6.1.2.4.2.0-258" % "provided", "org.apache.spark" % "spark-streaming-kafka-assembly_2.10" % "1.6.1.2.4.2.0-258" ) assemblyMergeStrategy in assembly := { case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last case PathList("com", "squareup", xs @ _*) => MergeStrategy.last case PathList("com", "sun", xs @ _*) => MergeStrategy.last case PathList("com", "thoughtworks", xs @ _*) => MergeStrategy.last case PathList("commons-beanutils", xs @ _*) => MergeStrategy.last case PathList("commons-cli", xs @ _*) => MergeStrategy.last case PathList("commons-collections", xs @ _*) => MergeStrategy.last case PathList("commons-io", xs @ _*) => MergeStrategy.last case PathList("io", "netty", xs @ _*) => MergeStrategy.last case PathList("javax", "activation", xs @ _*) => MergeStrategy.last case PathList("javax", "xml", xs @ _*) => MergeStrategy.last case PathList("org", "apache", xs @ _*) => MergeStrategy.last case PathList("org", "codehaus", xs @ _*) => MergeStrategy.last case PathList("org", "fusesource", xs @ _*) => MergeStrategy.last case PathList("org", "mortbay", xs @ _*) => MergeStrategy.last case PathList("org", "tukaani", xs @ _*) => MergeStrategy.last case PathList("xerces", xs @ _*) => MergeStrategy.last case PathList("xmlenc", xs @ _*) => MergeStrategy.last case "about.html" => MergeStrategy.rename case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last case "META-INF/mailcap" => MergeStrategy.last case "META-INF/mimetypes.default" => MergeStrategy.last case "plugin.properties" => MergeStrategy.last case "log4j.properties" => MergeStrategy.last case x => val oldStrategy = (assemblyMergeStrategy in assembly).value oldStrategy(x) }
1) The "resolvers" section adds the Hortonworks repositories.
2) In "libraryDependencies" you add Spark-Streaming (which will also load Spark-Core) and Spark-Kafka-Streaming jars. To avoid problems with Kafka dependencies it is best to use the "spark-streaming-kafka-assembly" fat jar.
Note that Spark-Streaming can be tagged as "provided" (it is omitted from the jat jar), since it is automatically available when you submit a job .
3) Unfortunately a lot of libraries are imported twice due to the dependencies which leads to assembly errors. To overcome the issue, the "assemblyMergeStrategy" section tells sbt assembly to always use the last one (which is from the spark jars). This list is handcrafted and might change in a new version of HDP. However the idea should be clear.
4) Assemble the project (if you call it the first time it will "download the internet" like maven)
sbt assembly
will create "target/scala-2.10/myproject-assembly-0.1.jar"
5) You can now submit it to Spark
spark-submit --master yarn --deploy-mode client \ --class my.package.MyProject target/scala-2.10/myproject-assembly-0.1.jar