Member since 
    
	
		
		
		10-07-2015
	
	
	
	
	
	
	
	
	
	
	
	
	
	
			
      
                107
            
            
                Posts
            
        
                73
            
            
                Kudos Received
            
        
                23
            
            
                Solutions
            
        My Accepted Solutions
| Title | Views | Posted | 
|---|---|---|
| 3222 | 02-23-2017 04:57 PM | |
| 2563 | 12-08-2016 09:55 AM | |
| 10035 | 11-24-2016 07:24 PM | |
| 4849 | 11-24-2016 02:17 PM | |
| 10310 | 11-24-2016 09:50 AM | 
			
    
	
		
		
		07-13-2016
	
		
		02:32 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 The official 0.6.0 is 16 days old (https://github.com/apache/zeppelin/releases) so I expect that the version in the Sandbox is a "0.6.0-preview" maybe more a 0.5.6 (I actually don't know)  Did you try to add     static {
    Interpreter.register("dummy", DummyInterpreter.class.getName());
  }  to your DummyInterpreter? I understand this was necessary before 0.6.0 (now deprecated but still supported in 0.6.0) 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-13-2016
	
		
		09:25 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Did you install the 2.4 Sandbox or HDP 2.4.2 from scratch via Ambari? In my HDP 2.4.2 (via Ambari, not Sandbox) installation I don't see Zeppelin. However I had the same version 0.0.5 in another stack, but it came from a very early manual installation of Zeppelin via https://github.com/hortonworks-gallery/ambari-zeppelin-service - and Ambari still remembered after the upgrade.  On my box I again manually installed Zeppelin. If I list <ZEPPELIN_ROOT>/zeppelin-server/target/*.jar, I get  zeppelin-server/target/zeppelin-server-0.6.0-SNAPSHOT.jar  Maybe you can try this on your machine to determine the actual Zeppelin version? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-07-2016
	
		
		06:17 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
	
		4 Kudos
		
	
				
		
	
		
					
							 
	In order to submit jobs to Spark, so called "fat jars" (containing all dependencies) are quite useful. If you develop your code in Scala, 
	"sbt" 
	(http://www.scala-sbt.org) is a great choice to build your project. The following relies on the newest version, sbt 0.13  
	For fat jars you first need 
	"sbt-assembly" (https://github.com/sbt/sbt-assembly). Assuming you have the standard sbt folder structure, the easiest way is to add a file "assembly.sbt"
	 into the "project" folder containing one line 
 addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
  
	The project structure now looks like (most probably without the "target" folder which will be created upon building the project) 
 MyProject
+-- build.sbt
+-- project
|   +-- assembly.sbt
+-- src
|   +-- main
|       +-- scala
|           +-- MyProject.scala
+-- target  
	For building Spark Kafka Streaming jobs on HDP 2.4.2, this is the build file 
	"build.sbt"  
 name := "MyProject"
version := "0.1"
scalaVersion := "2.10.6"
resolvers += "Hortonworks Repository" at "http://repo.hortonworks.com/content/repositories/releases/"
resolvers += "Hortonworks Jetty Maven Repository" at "http://repo.hortonworks.com/content/repositories/jetty-hadoop/"
libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-streaming_2.10" % "1.6.1.2.4.2.0-258" % "provided",
  "org.apache.spark" % "spark-streaming-kafka-assembly_2.10" % "1.6.1.2.4.2.0-258"
)
assemblyMergeStrategy in assembly := {
    case PathList("com",   "esotericsoftware", xs @ _*) => MergeStrategy.last
    case PathList("com",   "squareup", xs @ _*) => MergeStrategy.last
    case PathList("com",   "sun", xs @ _*) => MergeStrategy.last
    case PathList("com",   "thoughtworks", xs @ _*) => MergeStrategy.last
    case PathList("commons-beanutils", xs @ _*) => MergeStrategy.last
    case PathList("commons-cli", xs @ _*) => MergeStrategy.last
    case PathList("commons-collections", xs @ _*) => MergeStrategy.last
    case PathList("commons-io", xs @ _*) => MergeStrategy.last
    case PathList("io",    "netty", xs @ _*) => MergeStrategy.last
    case PathList("javax", "activation", xs @ _*) => MergeStrategy.last
    case PathList("javax", "xml", xs @ _*) => MergeStrategy.last
    case PathList("org",   "apache", xs @ _*) => MergeStrategy.last
    case PathList("org",   "codehaus", xs @ _*) => MergeStrategy.last
    case PathList("org",   "fusesource", xs @ _*) => MergeStrategy.last
    case PathList("org",   "mortbay", xs @ _*) => MergeStrategy.last
    case PathList("org",   "tukaani", xs @ _*) => MergeStrategy.last
    case PathList("xerces", xs @ _*) => MergeStrategy.last
    case PathList("xmlenc", xs @ _*) => MergeStrategy.last
    case "about.html" => MergeStrategy.rename
    case "META-INF/ECLIPSEF.RSA" => MergeStrategy.last
    case "META-INF/mailcap" => MergeStrategy.last
    case "META-INF/mimetypes.default" => MergeStrategy.last
    case "plugin.properties" => MergeStrategy.last
    case "log4j.properties" => MergeStrategy.last
    case x =>
        val oldStrategy = (assemblyMergeStrategy in assembly).value
        oldStrategy(x)
}
  
	1) The 
	"resolvers" section adds the Hortonworks repositories.  
	2) In 
	"libraryDependencies"  you add Spark-Streaming (which will also load Spark-Core) and Spark-Kafka-Streaming jars. To avoid problems with Kafka dependencies it is best to use the "spark-streaming-kafka-assembly" fat jar.  
	Note that Spark-Streaming can be tagged as 
	"provided" (it is omitted from the jat jar), since it is automatically available when you submit a job .  
	3) Unfortunately a lot of libraries are imported twice due to the dependencies which leads to assembly errors. To overcome the issue, the 
	"assemblyMergeStrategy"  section tells sbt assembly to always use the last one (which is from the spark jars). This list is handcrafted and might change in a new version of HDP. However the idea should be clear.  
	4) Assemble the project (if you call it the first time it will "download the internet" like maven) 
 sbt assembly
  will create "target/scala-2.10/myproject-assembly-0.1.jar"  5) You can now submit it to Spark  spark-submit --master yarn --deploy-mode client \
             --class my.package.MyProject target/scala-2.10/myproject-assembly-0.1.jar 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
		
			
				
						
							Labels:
						
						
		
	
					
			
		
	
	
	
	
				
		
	
	
			
    
	
		
		
		07-01-2016
	
		
		02:43 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 
	Having described all that I still think the proper Spark way is to use 
 df.write.format("csv").save("/tmp/df.csv")
  or  df.repartition(1).write.format("csv").save("/tmp/df.csv") 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-01-2016
	
		
		02:40 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 
	I am not sure that this is what you want. If you have more than 1 spark executor then every executor will independently write parts of the data (one per each rdd partition). For example with two executors it looks like: 
 hdfs dfs -ls /tmp/df.txt Found 3 items
-rw-r--r--   3 root hdfs          0 2016-07-01 14:07 /tmp/df.txt/_SUCCESS
-rw-r--r--   3 root hdfs      83327 2016-07-01 14:07 /tmp/df.txt/part-00000
-rw-r--r--   3 root hdfs      83126 2016-07-01 14:07 /tmp/df.txt/part-00001  This is why the filename gets a folder. When you use this folder name as input in other Hadoop tools, they will read all files below (as if it would be one file). It is all about supporting distributed computation and writes  However if you want to force a single "part" file you need to force spark to write only with one executor  bank.rdd.repartition(1).saveAsTextFile("/tmp/df2.txt")  It then looks like  hdfs dfs -ls /tmp/df2.txt
Found 2 items
-rw-r--r--   3 root hdfs          0 2016-07-01 16:34 /tmp/df2.txt/_SUCCESS
-rw-r--r--   3 root hdfs     166453 2016-07-01 16:34 /tmp/df2.txt/part-00000  Note the size and compare with the above  You then can copy it to a file you want  hdfs dfs -cp /tmp/df2.txt/part-00000 /tmp/df3.txt 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-01-2016
	
		
		12:10 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 You can also use df.rdd.saveAsTextFile("/tmp/df.txt")  Again this will be a folder with a file part-00000 holding lines like [abc,42] 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-01-2016
	
		
		12:03 PM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Since it is a dataframe (column representation) csv is the best option to have it as text file.  Any issues with csv? 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		07-01-2016
	
		
		09:08 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Try  df.write.format("csv").save("/tmp/df.csv")  It will create a folder /tmp/df.csv in hdfs with part-00000 as the csv 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-10-2016
	
		
		09:09 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 For quoting identifiers see, e.g. https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/hive-013-feature-quoted-identifiers.html 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		
			
    
	
		
		
		05-10-2016
	
		
		09:07 AM
	
	
	
	
	
	
	
	
	
	
	
	
	
	
		
	
				
		
			
					
				
		
	
		
					
							 Use backticks ` and not ' 
						
					
					... View more
				
			
			
			
			
			
			
			
			
			
		 
        













