Member since
10-07-2015
107
Posts
73
Kudos Received
23
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2521 | 02-23-2017 04:57 PM | |
1972 | 12-08-2016 09:55 AM | |
8857 | 11-24-2016 07:24 PM | |
3957 | 11-24-2016 02:17 PM | |
9316 | 11-24-2016 09:50 AM |
07-13-2016
02:32 PM
The official 0.6.0 is 16 days old (https://github.com/apache/zeppelin/releases) so I expect that the version in the Sandbox is a "0.6.0-preview" maybe more a 0.5.6 (I actually don't know) Did you try to add static {
Interpreter.register("dummy", DummyInterpreter.class.getName());
} to your DummyInterpreter? I understand this was necessary before 0.6.0 (now deprecated but still supported in 0.6.0)
... View more
07-13-2016
09:25 AM
Did you install the 2.4 Sandbox or HDP 2.4.2 from scratch via Ambari? In my HDP 2.4.2 (via Ambari, not Sandbox) installation I don't see Zeppelin. However I had the same version 0.0.5 in another stack, but it came from a very early manual installation of Zeppelin via https://github.com/hortonworks-gallery/ambari-zeppelin-service - and Ambari still remembered after the upgrade. On my box I again manually installed Zeppelin. If I list <ZEPPELIN_ROOT>/zeppelin-server/target/*.jar, I get zeppelin-server/target/zeppelin-server-0.6.0-SNAPSHOT.jar Maybe you can try this on your machine to determine the actual Zeppelin version?
... View more
07-07-2016
06:17 PM
4 Kudos
In order to submit jobs to Spark, so called "fat jars" (containing all dependencies) are quite useful. If you develop your code in Scala,
"sbt"
(http://www.scala-sbt.org) is a great choice to build your project. The following relies on the newest version, sbt 0.13
For fat jars you first need
"sbt-assembly" (https://github.com/sbt/sbt-assembly). Assuming you have the standard sbt folder structure, the easiest way is to add a file "assembly.sbt"
into the "project" folder containing one line
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
The project structure now looks like (most probably without the "target" folder which will be created upon building the project)
MyProject
+-- build.sbt
+-- project
| +-- assembly.sbt
+-- src
| +-- main
| +-- scala
| +-- MyProject.scala
+-- target
For building Spark Kafka Streaming jobs on HDP 2.4.2, this is the build file
"build.sbt"
name := "MyProject"
version := "0.1"
scalaVersion := "2.10.6"
resolvers += "Hortonworks Repository" at "http://repo.hortonworks.com/content/repositories/releases/"
resolvers += "Hortonworks Jetty Maven Repository" at "http://repo.hortonworks.com/content/repositories/jetty-hadoop/"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-streaming_2.10" % "1.6.1.2.4.2.0-258" % "provided",
"org.apache.spark" % "spark-streaming-kafka-assembly_2.10" % "1.6.1.2.4.2.0-258"
)
assemblyMergeStrategy in assembly := {
case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last
case PathList("com", "squareup", xs @ _*) => MergeStrategy.last
case PathList("com", "sun", xs @ _*) => MergeStrategy.last
case PathList("com", "thoughtworks", xs @ _*) => MergeStrategy.last
case PathList("commons-beanutils", xs @ _*) => MergeStrategy.last
case PathList("commons-cli", xs @ _*) => MergeStrategy.last
case PathList("commons-collections", xs @ _*) => MergeStrategy.last
case PathList("commons-io", xs @ _*) => MergeStrategy.last
case PathList("io", "netty", xs @ _*) => MergeStrategy.last
case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
case PathList("javax", "xml", xs @ _*) => MergeStrategy.last
case PathList("org", "apache", xs @ _*) => MergeStrategy.last
case PathList("org", "codehaus", xs @ _*) => MergeStrategy.last
case PathList("org", "fusesource", xs @ _*) => MergeStrategy.last
case PathList("org", "mortbay", xs @ _*) => MergeStrategy.last
case PathList("org", "tukaani", xs @ _*) => MergeStrategy.last
case PathList("xerces", xs @ _*) => MergeStrategy.last
case PathList("xmlenc", xs @ _*) => MergeStrategy.last
case "about.html" => MergeStrategy.rename
case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
case "META-INF/mailcap" => MergeStrategy.last
case "META-INF/mimetypes.default" => MergeStrategy.last
case "plugin.properties" => MergeStrategy.last
case "log4j.properties" => MergeStrategy.last
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
1) The
"resolvers" section adds the Hortonworks repositories.
2) In
"libraryDependencies" you add Spark-Streaming (which will also load Spark-Core) and Spark-Kafka-Streaming jars. To avoid problems with Kafka dependencies it is best to use the "spark-streaming-kafka-assembly" fat jar.
Note that Spark-Streaming can be tagged as
"provided" (it is omitted from the jat jar), since it is automatically available when you submit a job .
3) Unfortunately a lot of libraries are imported twice due to the dependencies which leads to assembly errors. To overcome the issue, the
"assemblyMergeStrategy" section tells sbt assembly to always use the last one (which is from the spark jars). This list is handcrafted and might change in a new version of HDP. However the idea should be clear.
4) Assemble the project (if you call it the first time it will "download the internet" like maven)
sbt assembly
will create "target/scala-2.10/myproject-assembly-0.1.jar" 5) You can now submit it to Spark spark-submit --master yarn --deploy-mode client \
--class my.package.MyProject target/scala-2.10/myproject-assembly-0.1.jar
... View more
Labels:
07-01-2016
02:43 PM
Having described all that I still think the proper Spark way is to use
df.write.format("csv").save("/tmp/df.csv")
or df.repartition(1).write.format("csv").save("/tmp/df.csv")
... View more
07-01-2016
02:40 PM
I am not sure that this is what you want. If you have more than 1 spark executor then every executor will independently write parts of the data (one per each rdd partition). For example with two executors it looks like:
hdfs dfs -ls /tmp/df.txt Found 3 items
-rw-r--r-- 3 root hdfs 0 2016-07-01 14:07 /tmp/df.txt/_SUCCESS
-rw-r--r-- 3 root hdfs 83327 2016-07-01 14:07 /tmp/df.txt/part-00000
-rw-r--r-- 3 root hdfs 83126 2016-07-01 14:07 /tmp/df.txt/part-00001 This is why the filename gets a folder. When you use this folder name as input in other Hadoop tools, they will read all files below (as if it would be one file). It is all about supporting distributed computation and writes However if you want to force a single "part" file you need to force spark to write only with one executor bank.rdd.repartition(1).saveAsTextFile("/tmp/df2.txt") It then looks like hdfs dfs -ls /tmp/df2.txt
Found 2 items
-rw-r--r-- 3 root hdfs 0 2016-07-01 16:34 /tmp/df2.txt/_SUCCESS
-rw-r--r-- 3 root hdfs 166453 2016-07-01 16:34 /tmp/df2.txt/part-00000 Note the size and compare with the above You then can copy it to a file you want hdfs dfs -cp /tmp/df2.txt/part-00000 /tmp/df3.txt
... View more
07-01-2016
12:10 PM
You can also use df.rdd.saveAsTextFile("/tmp/df.txt") Again this will be a folder with a file part-00000 holding lines like [abc,42]
... View more
07-01-2016
12:03 PM
Since it is a dataframe (column representation) csv is the best option to have it as text file. Any issues with csv?
... View more
07-01-2016
09:08 AM
Try df.write.format("csv").save("/tmp/df.csv") It will create a folder /tmp/df.csv in hdfs with part-00000 as the csv
... View more
05-10-2016
09:09 AM
For quoting identifiers see, e.g. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/hive-013-feature-quoted-identifiers.html
... View more
05-10-2016
09:07 AM
Use backticks ` and not '
... View more