What is the best way to develop Spark applications on your local computer? I'm using IntelliJ and trying to set the master, just for debugging purposes, to my remote HDP cluster so I can test code against Hive and other resources on my cluster. I'm using HDP 2.5.3 and I've added the spark libraries for scala 2.10 and spark 1.6.2 from the maven repository. I've set my build.sbt scalaVersion to 2.10.5 and added the library dependencies. As far as I can tell, I have the exact same versions that are running in HDP 2.5.3 in my project, but when I try to run the application pointing the SparkConf to my remote spark master I get the following error for an incompatible class:
java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class incompatible: stream classdesc serialVersionUID = 5009924811397974881, local class serialVersionUID = 7185378471520864965
Is there something I'm missing, or is there a better way to develop and test against the remote cluster?
@Eric Hanson Which repository are you getting the spark libraries from? Use the hortonworks repo. Check out this documentation on how to build spark streaming apps. It can be adapted to SBT an non-streaming apps.
Also HDP 2.5 includes two different versions of spark. Check the settings for SPARK_HOME. For 1.6 use:
Thank you @cduby. I'm using sbt, but maybe I should use a maven project instead. I tried to translate the examples in the link you shared to sbt dependencies and it seemed to work for all of the apache dependencies, but then I get an error for an unresolved dependency on "org.mortbay.jetty#jetty;6.1.26.hwx" and "org.mortbay.jetty#jetty-util;6.1.26.hwx", which I didn't have as a dependency in my project. I tried adding library dependencies in my build.sbt for them, but still get the error. I looked on the repo site and all I found was /org/mortbay/jetty/project/6.1.26.hwx for directory. I don't understand why I have a dependency on this or how to resolve it. Do you know how to resolve this error? I may try creating a maven project and see if I still get the error message. Thanks again for your help.
I created a maven project instead and didn't get the dependency resolution errors, but I'm getting class not found errors when I try to run the program. Is the only way to run your code on a cluster to do a spark-submit? I was really wanting to just debug my code as I add to it to make sure it behaves as expected before and then move it to the cluster and submit it as a job
If you have any questions on testing spark let me know. Holden's testing is very good way to do this.
Also, going line by line in Zeppelin is a great way to debug Spark code.
Thank you @Timothy Spann. I've had problems testing in Zeppelin before when I need to import in kafka streaming, but I'll give it another go. I've been able to test using the local master for the most part, but when I'm trying to test out some interaction with Hive and possibly HBase and that's why I have been trying to run the code against my cluster where I have the Hive and HBase services running.
make Spark 1.6.2 and not all the sub version #s.
Make Kafka, kafka in the reference just in case.
What JDK are you using to compile?
Is this Linux, Windows or OSX you are compiling from?
Make sure you are building with 1.6.2 and pushing to 1.6.2, having both versions may confuse things.
Make sure you have no firewall issues.
if you run from the command line on your PC (spark submit) does that work, is it only an issue in IntelliJ?
@Timothy Spann I'm using java 1.8 JDK, compiling on Windows. I'm pretty sure I'm building and pushing to 1.6.2. I have a couple of versions of Spark on my Windows machine, but my SPARK_HOME is set to the 1.6.2 version. I tried submitting it from the command line on my local machine with the --master option set to my remote cluster and received a class not found error, but this time it was for KafkaUtils and not the org.apache.spark.rdd.RDD class that I get when I try to run it feom IntelliJ
A full stack trace would help understand which interaction is resulting in this.
If IDE based code is being used then you could try to not use the spark-assembly jar that is present on HDFS and instead use the local spark-assembly jar from the Spark build being compiled against. This could be done by overriding spark.yarn.jar config. Could be the that compile dependency of Spark in your IDE is different from the runtime dependency on HDFS.
Another thing could be scala version mismatch.
I have been able to fix the issue for the invalid calss exception on the spark.rdd class, but now I'm getting errors for other classes and anonymous functions in my class. I created a new project with maven dependency instead of sbt and followed the instructions for pointing to the hortonworks repo as @cduby suggested. I used the full spark version in my depencies (18.104.22.168.5.3.0-37). I then removed the Spark I had downloaded locally and instead downloaded the 1.6.2 source code and built it with scala 2.10.5. Now when I run my kafka streaming spark program from the IDE using sparkConf().setMaster("spark://myRemotecluster.domain.com:7077"), I get a class not found exception for org.apache.spark.streaming.kafka.KafkaReceiver instead. I tried running the SparkPi example class from the spark examples the same way as I ran my streaming project, so I could test to see if the core spark libraries would run and I got a class not found exception for com.mycompany.scala.SparkPi$$anonfun$1, so I'm thinking this is a problem with my classes needing to be available on the remote cluster. Is there something I'm missing, or am I just going about this the wrong way? If I could run in local mode and still interact with Hive, HBase, and Hadoop I wouldn't care so much about running my program from the IDE before submitting it to the server.
@Eric Hanson I think the problem is that the maven pom is not creating an Uber Jar. When a spark job runs remotely, all the jars for the job need to be sent to the worker nodes. The class not found is because some of the jars are missing. The build goes ok because the build system has the jars but they the jar is packaged, it doesn't contain all the dependent jars it needs. This stack overflow article has a good description:
Also see the optional instruction 3 on building an uber jar (one that contains all the dependencies)
Below instruction 3 are the command lines to use whether an uber jar is created or not. If an uber jar is created you just specify the jar in the command line. If it is not an uber jar, you need to pass in all the dependent jars as well.
@cduby All of the examples I see are for doing a spark-submit. I'm trying to run the code within the IntelliJ IDE by right clicking on the class and selecting run. I was doing this by setting the master in the code to local and it worked fine up until the point where I wanted to use hive and then I started getting errors, so I tried to set the master to my remote cluster instead. Is there a better way to do that, so you can actually run your code before you package it and submit it and check the logs to see if it worked?
@Eric Hanson I have not used IntelliJ so I can't advise on options there. However try building the uber jar, scp it to your edge node and use spark-submit. This will verify you have the correct jar building.
As for local testing the spark-testing-base mentioned in Tim's article looks like it will work for unit tests but at some point you are going to need to run it on a remote cluster.