About bwalter1

bwalter1 · ‎07-13-2016

The official 0.6.0 is 16 days old (https://github.com/apache/zeppelin/releases) so I expect that the version in the Sandbox is a "0.6.0-preview" maybe more a 0.5.6 (I actually don't know) Did you try to add static { Interpreter.register("dummy", DummyInterpreter.class.getName()); } to your DummyInterpreter? I understand this was necessary before 0.6.0 (now deprecated but still supported in 0.6.0)

bwalter1 · ‎07-13-2016

Did you install the 2.4 Sandbox or HDP 2.4.2 from scratch via Ambari? In my HDP 2.4.2 (via Ambari, not Sandbox) installation I don't see Zeppelin. However I had the same version 0.0.5 in another stack, but it came from a very early manual installation of Zeppelin via https://github.com/hortonworks-gallery/ambari-zeppelin-service - and Ambari still remembered after the upgrade. On my box I again manually installed Zeppelin. If I list <ZEPPELIN_ROOT>/zeppelin-server/target/*.jar, I get zeppelin-server/target/zeppelin-server-0.6.0-SNAPSHOT.jar Maybe you can try this on your machine to determine the actual Zeppelin version?

bwalter1 · ‎07-07-2016

In order to submit jobs to Spark, so called "fat jars" (containing all dependencies) are quite useful. If you develop your code in Scala, "sbt" (http://www.scala-sbt.org) is a great choice to build your project. The following relies on the newest version, sbt 0.13 For fat jars you first need "sbt-assembly" (https://github.com/sbt/sbt-assembly). Assuming you have the standard sbt folder structure, the easiest way is to add a file "assembly.sbt" into the "project" folder containing one line addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3") The project structure now looks like (most probably without the "target" folder which will be created upon building the project) MyProject +-- build.sbt +-- project | +-- assembly.sbt +-- src | +-- main | +-- scala | +-- MyProject.scala +-- target For building Spark Kafka Streaming jobs on HDP 2.4.2, this is the build file "build.sbt" name := "MyProject" version := "0.1" scalaVersion := "2.10.6" resolvers += "Hortonworks Repository" at "http://repo.hortonworks.com/content/repositories/releases/" resolvers += "Hortonworks Jetty Maven Repository" at "http://repo.hortonworks.com/content/repositories/jetty-hadoop/" libraryDependencies ++= Seq( "org.apache.spark" % "spark-streaming_2.10" % "1.6.1.2.4.2.0-258" % "provided", "org.apache.spark" % "spark-streaming-kafka-assembly_2.10" % "1.6.1.2.4.2.0-258" ) assemblyMergeStrategy in assembly := { case PathList("com", "esotericsoftware", xs @ _*) => MergeStrategy.last case PathList("com", "squareup", xs @ _*) => MergeStrategy.last case PathList("com", "sun", xs @ _*) => MergeStrategy.last case PathList("com", "thoughtworks", xs @ _*) => MergeStrategy.last case PathList("commons-beanutils", xs @ _*) => MergeStrategy.last case PathList("commons-cli", xs @ _*) => MergeStrategy.last case PathList("commons-collections", xs @ _*) => MergeStrategy.last case PathList("commons-io", xs @ _*) => MergeStrategy.last case PathList("io", "netty", xs @ _*) => MergeStrategy.last case PathList("javax", "activation", xs @ _*) => MergeStrategy.last case PathList("javax", "xml", xs @ _*) => MergeStrategy.last case PathList("org", "apache", xs @ _*) => MergeStrategy.last case PathList("org", "codehaus", xs @ _*) => MergeStrategy.last case PathList("org", "fusesource", xs @ _*) => MergeStrategy.last case PathList("org", "mortbay", xs @ _*) => MergeStrategy.last case PathList("org", "tukaani", xs @ _*) => MergeStrategy.last case PathList("xerces", xs @ _*) => MergeStrategy.last case PathList("xmlenc", xs @ _*) => MergeStrategy.last case "about.html" => MergeStrategy.rename case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last case "META-INF/mailcap" => MergeStrategy.last case "META-INF/mimetypes.default" => MergeStrategy.last case "plugin.properties" => MergeStrategy.last case "log4j.properties" => MergeStrategy.last case x => val oldStrategy = (assemblyMergeStrategy in assembly).value oldStrategy(x) } 1) The "resolvers" section adds the Hortonworks repositories. 2) In "libraryDependencies" you add Spark-Streaming (which will also load Spark-Core) and Spark-Kafka-Streaming jars. To avoid problems with Kafka dependencies it is best to use the "spark-streaming-kafka-assembly" fat jar. Note that Spark-Streaming can be tagged as "provided" (it is omitted from the jat jar), since it is automatically available when you submit a job . 3) Unfortunately a lot of libraries are imported twice due to the dependencies which leads to assembly errors. To overcome the issue, the "assemblyMergeStrategy" section tells sbt assembly to always use the last one (which is from the spark jars). This list is handcrafted and might change in a new version of HDP. However the idea should be clear. 4) Assemble the project (if you call it the first time it will "download the internet" like maven) sbt assembly will create "target/scala-2.10/myproject-assembly-0.1.jar" 5) You can now submit it to Spark spark-submit --master yarn --deploy-mode client \ --class my.package.MyProject target/scala-2.10/myproject-assembly-0.1.jar

bwalter1 · ‎07-01-2016

Having described all that I still think the proper Spark way is to use df.write.format("csv").save("/tmp/df.csv") or df.repartition(1).write.format("csv").save("/tmp/df.csv")

bwalter1 · ‎07-01-2016

I am not sure that this is what you want. If you have more than 1 spark executor then every executor will independently write parts of the data (one per each rdd partition). For example with two executors it looks like: hdfs dfs -ls /tmp/df.txt Found 3 items -rw-r--r-- 3 root hdfs 0 2016-07-01 14:07 /tmp/df.txt/_SUCCESS -rw-r--r-- 3 root hdfs 83327 2016-07-01 14:07 /tmp/df.txt/part-00000 -rw-r--r-- 3 root hdfs 83126 2016-07-01 14:07 /tmp/df.txt/part-00001 This is why the filename gets a folder. When you use this folder name as input in other Hadoop tools, they will read all files below (as if it would be one file). It is all about supporting distributed computation and writes However if you want to force a single "part" file you need to force spark to write only with one executor bank.rdd.repartition(1).saveAsTextFile("/tmp/df2.txt") It then looks like hdfs dfs -ls /tmp/df2.txt Found 2 items -rw-r--r-- 3 root hdfs 0 2016-07-01 16:34 /tmp/df2.txt/_SUCCESS -rw-r--r-- 3 root hdfs 166453 2016-07-01 16:34 /tmp/df2.txt/part-00000 Note the size and compare with the above You then can copy it to a file you want hdfs dfs -cp /tmp/df2.txt/part-00000 /tmp/df3.txt

bwalter1 · ‎07-01-2016

You can also use df.rdd.saveAsTextFile("/tmp/df.txt") Again this will be a folder with a file part-00000 holding lines like [abc,42]

bwalter1 · ‎07-01-2016

Since it is a dataframe (column representation) csv is the best option to have it as text file. Any issues with csv?

bwalter1 · ‎07-01-2016

Try df.write.format("csv").save("/tmp/df.csv") It will create a folder /tmp/df.csv in hdfs with part-00000 as the csv

bwalter1 · ‎05-10-2016

For quoting identifiers see, e.g. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/hive-013-feature-quoted-identifiers.html

bwalter1 · ‎05-10-2016

Use backticks ` and not '

Online	Offline
Last Visited	‎06-07-2017 08:08 AM

Member Since	‎10-07-2015 10:28 PM
Last Visited	‎06-07-2017 08:08 AM
Posts	107
Kudos received	72

Cloudera Community

Re: Spark and HIVE

Re: Spark SQL 2.0 - performance of Plain SQL query...

Re: ReplaceText Regex to replace double quotes in ...

Re: How could I use pandas library in Pyspark in Z...

Re: Got 401 - Full authentication is required to a...

Re: Custom Zeppelin interpreter not found (HDP San...

Re: Weird Zeppelin version in HDP 2.4

Creating fat jars for Spark Kafka Streaming using ...

Re: storage dataframe as textfile in hdfs

Re: storage dataframe as textfile in hdfs

Re: storage dataframe as textfile in hdfs

Re: storage dataframe as textfile in hdfs

Re: storage dataframe as textfile in hdfs

Re: Derived column created as _c0 in Hive . Cant ...

Re: Derived column created as _c0 in Hive . Cant ...