Community Articles

bwalter1 · ‎07-07-2016

In order to submit jobs to Spark, so called "fat jars" (containing all dependencies) are quite useful. If you develop your code in Scala, "sbt" (http://www.scala-sbt.org) is a great choice to build your project. The following relies on the newest version, sbt 0.13

For fat jars you first need "sbt-assembly" (https://github.com/sbt/sbt-assembly). Assuming you have the standard sbt folder structure, the easiest way is to add a file "assembly.sbt" into the "project" folder containing one line

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

The project structure now looks like (most probably without the "target" folder which will be created upon building the project)

MyProject
+-- build.sbt
+-- project
|   +-- assembly.sbt
+-- src
|   +-- main
|       +-- scala
|           +-- MyProject.scala
+-- target

For building Spark Kafka Streaming jobs on HDP 2.4.2, this is the build file "build.sbt"

name := "MyProject"
version := "0.1"
scalaVersion := "2.10.6"

resolvers += "Hortonworks Repository" at "http://repo.hortonworks.com/content/repositories/releases/"
resolvers += "Hortonworks Jetty Maven Repository" at "http://repo.hortonworks.com/content/repositories/jetty-hadoop/"

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-streaming_2.10" % "1.6.1.2.4.2.0-258" % "provided",
  "org.apache.spark" % "spark-streaming-kafka-assembly_2.10" % "1.6.1.2.4.2.0-258"
)

assemblyMergeStrategy in assembly := {
    case PathList("com",   "esotericsoftware", xs @ _*) => MergeStrategy.last
    case PathList("com",   "squareup", xs @ _*) => MergeStrategy.last
    case PathList("com",   "sun", xs @ _*) => MergeStrategy.last
    case PathList("com",   "thoughtworks", xs @ _*) => MergeStrategy.last
    case PathList("commons-beanutils", xs @ _*) => MergeStrategy.last
    case PathList("commons-cli", xs @ _*) => MergeStrategy.last
    case PathList("commons-collections", xs @ _*) => MergeStrategy.last
    case PathList("commons-io", xs @ _*) => MergeStrategy.last
    case PathList("io",    "netty", xs @ _*) => MergeStrategy.last
    case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
    case PathList("javax", "xml", xs @ _*) => MergeStrategy.last
    case PathList("org",   "apache", xs @ _*) => MergeStrategy.last
    case PathList("org",   "codehaus", xs @ _*) => MergeStrategy.last
    case PathList("org",   "fusesource", xs @ _*) => MergeStrategy.last
    case PathList("org",   "mortbay", xs @ _*) => MergeStrategy.last
    case PathList("org",   "tukaani", xs @ _*) => MergeStrategy.last
    case PathList("xerces", xs @ _*) => MergeStrategy.last
    case PathList("xmlenc", xs @ _*) => MergeStrategy.last
    case "about.html" => MergeStrategy.rename
    case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
    case "META-INF/mailcap" => MergeStrategy.last
    case "META-INF/mimetypes.default" => MergeStrategy.last
    case "plugin.properties" => MergeStrategy.last
    case "log4j.properties" => MergeStrategy.last
    case x =>
        val oldStrategy = (assemblyMergeStrategy in assembly).value
        oldStrategy(x)
}

1) The "resolvers" section adds the Hortonworks repositories.

2) In "libraryDependencies" you add Spark-Streaming (which will also load Spark-Core) and Spark-Kafka-Streaming jars. To avoid problems with Kafka dependencies it is best to use the "spark-streaming-kafka-assembly" fat jar.

Note that Spark-Streaming can be tagged as "provided" (it is omitted from the jat jar), since it is automatically available when you submit a job .

3) Unfortunately a lot of libraries are imported twice due to the dependencies which leads to assembly errors. To overcome the issue, the "assemblyMergeStrategy" section tells sbt assembly to always use the last one (which is from the spark jars). This list is handcrafted and might change in a new version of HDP. However the idea should be clear.

4) Assemble the project (if you call it the first time it will "download the internet" like maven)

sbt assembly

will create "target/scala-2.10/myproject-assembly-0.1.jar"

5) You can now submit it to Spark

spark-submit --master yarn --deploy-mode client \
             --class my.package.MyProject target/scala-2.10/myproject-assembly-0.1.jar

Cloudera Community

Community Articles

Creating fat jars for Spark Kafka Streaming using sbt

Apache Kafka

Apache Spark