I am looking for
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.6.0</version> </dependency>
But the cloudera version how do I find while maintaning the spark version same for spark streaming
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>1.6.0</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.10</artifactId> <version>1.6.1</version> </dependency>
As well as for say different versions of spark
I found this but I am still a bit lost.
In your pom.xml, all the Spark related artifacts have to be at the same version - this is true if you're using Cloudera provided artifacts or not. The best way to keep them in sync is split out the version into another property, call it spark.version, and reference it in your dependency versions. If you're using the Cloudera versions, you'd use the version indicated on the URL from your post for the Spark artifacts (1.6.0-cdh5.7.6), or for the version of CDH you're running, and use the Cloudera Maven repo per the link below.
Finally, if you're building a shaded jar be aware that everything that you include as dependencies could conflict with some dependency Spark also includes at run-time via spark-submit. At the very least you'll want to set "provided" scope for the Spark artifacts so they aren't included in any assembly / shaded jar. You may also need to set exclusions on dependencies and in the shade plugin configuration. Running "mvn dependency:tree" will help find those dependency conflicts, but it can still be painful to get configured correctly.
Examples below, the Cloudera docs on developing Spark apps have good examples too.
<properties> <spark.version>1.6.0-cdh5.7.6</spark.version> </properties>