Reply
New Contributor
Posts: 4
Registered: ‎09-16-2015

Re: Spark distributed classpath

Hi zlpmichelle,

 

This problem came from the fact that I wasn't using a CDH artifact on my Maven dependency. If you package guava 16.0.1 into your jar, you still get this problem?

 

It is still a bit obscure for me how exactly the classpath.txt works and why it mixes several versions of the same API :(

Contributor
Posts: 41
Registered: ‎02-23-2016

Re: Spark distributed classpath

Agree, andreF. Thanks for your resposen.

 

I package guava 16.0.1 into my jar, it still get the same problem, I guess it has guava version confliction, since there are serveral different guava in CDH's classpath.txt.

 

I really confused about this. 

Cloudera Employee
Posts: 277
Registered: ‎01-16-2014

Re: Spark distributed classpath

If you need a specific version of guava you can not just add it to the classpath. If you do you totally rely on the randomness that is in the class loaders. There is no guarantee that you will get the proper version of guava loaded.

 

First thing that you need to do is make sure that you get the proper version of guava loaded at all times. To do this the proper way is to shade (mvn) or shadow (gradle) your guava. Check the web on how to do this. It is really the only way to make sure you get the correct version and not break the rest of hadoop at the same time.

 

After that is done you need to use the class path addition as discussed earlier and make sure that you add your shaded version.

 

This i sthe only way to do this without being vulnerable for changes in the hadoop dependencies.

 

Wilfred

Contributor
Posts: 41
Registered: ‎02-23-2016

Re: Spark distributed classpath

Thanks very much Wilfred for your helpful suggestion! We tried shadow before in gradle, at that time it still had the guava Nosuchmethod issue. Let's try shadow again with adding guava-16.0.1.jar in oozie sharelib to see whether it can work.

Contributor
Posts: 41
Registered: ‎02-23-2016

Re: Spark distributed classpath

We generated shadow (gradle) to shadow guava-16.0.1 for sparktest.jar, and put the shadow (gradle) sparktest.jar into oozie's share lib classpath, then run the shawsaprktest.jar in CDH oozie, it throws following exception:


Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org.jets3t.service.ServiceException: Request Error: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: Request Error: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:478)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at org.apache.hadoop.fs.s3native.$Proxy48.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:468)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:272)
at org.apache.hadoop.fs.Globber.glob(Globber.java:151)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1653)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD.count(RDD.scala:1121)
at com.gridx.spark.MeterReadingLoader$.load(MeterReadingLoader.scala:120)
at com.gridx.spark.MeterReadingLoader$.main(MeterReadingLoader.scala:101)
at com.gridx.spark.MeterReadingLoader.main(MeterReadingLoader.scala)

 

 

According to http://quabr.com/20613854/exception-when-getting-a-list-of-buckets-on-s3-using-jets3t, after changing httpclient from 4.2.5 jar  to 4.2 jar in hdfs oozie shared lib, it throws following exception:


JA008: File does not exist: hdfs://ip-10-0-4-248.us-west-1.compute.internal:8020/user/oozie/share/lib/lib_20151201085935/spark/httpclient-4.2.5.jar

Cloudera Employee
Posts: 277
Registered: ‎01-16-2014

Re: Spark distributed classpath

You can not just replace a file in HDFS and expect it to be picked up. The files will be localised during the run and there is a check to make sure that the files are were they should be. See the blog on how the sharelib works.

 

The OOTB version of Spark that we deliver with CDH does not throw the error that you show. It runs with the provided http client so I doubt that replacing the jar is the proper solution. It most likely is due to a mismatch in one of the other jars that results in this error.

 

Wilfred

Contributor
Posts: 41
Registered: ‎02-23-2016

Re: Spark distributed classpath

Thanks Wilfred.

 

This issue is from http://community.cloudera.com/t5/Batch-Processing-and-Workflow/how-to-add-external-guava-16-0-1-jar-...

 

Does that mean there is no solution to let CDH5.5.0  Hue/Oozie support spark action(Spark 1.5.0) which will write data into Cassandra 2.1.11?

Cloudera Employee
Posts: 277
Registered: ‎01-16-2014

Re: Spark distributed classpath

You most likely have pulled in too many dependencies when you build your application. When you look at the gradle documentation for building it shows that it behaves differently than the maven. When you pack up an application gradle includes far more dependencies than maven. This could have pulled in dependencies which you don't want or need.

Make sure that you only have in the application what you really need and that is not provided by hadoop. Search for gradle and dependency management. You need some way to define a "provided" scope in gradle.

 

Wilfred

Explorer
Posts: 11
Registered: ‎07-26-2016

Re: Spark distributed classpath

I understand this is older post but I am getting same problem. Can you please provide solution if it is resolved for you?

Thanks
Announcements