Support Questions

Find answers, ask questions, and share your expertise

Spark distributed classpath

avatar
Explorer

We have Spark installed via Cloudera Manager on a YARN cluster. It appears there is a classpath.txt file in /etc/spark/conf that include list of jars that should be available on spark's distributed classpath. And spark-env.sh seems to be the on that's exporting this configuration. 

 

It is my understanding that cloudera manager creates the classpath.txt file. I would like how does cloudera manger evaluate the list of jars that go into this file, and is it something that can be controlled through cloudera manager.

 

Thank you!

1 ACCEPTED SOLUTION

avatar
Super Collaborator

yes CM generates this as part of the gateway (client config). The classpath text file is generated by CM based on the dependencies that are defined in the deployment.

This is not something you can change.

 

As you can see in the upstream docs we use a form of hadoop free distribution but we still only test this with CDH and the specific dependencies.

 

Does that explain what you are lookign for?

 

WIlfred

View solution in original post

18 REPLIES 18

avatar
Rising Star

Hi Wilfred,

 

 

I have the similar issue as andreF's, we have serval differnt guava in /etc/spark/conf/classpath.txtdo you know how to fix the issue?

 

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-11.0.2.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-11.0.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-14.0.1.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-16.0.1.jar

 

Our app needs to use guava-16.0.1.jar, so I add guava-16.0.1.jar into /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/, and add "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-16.0.1.jar" into /etc/spark/conf/classpath.txt.

 

However, it doesn't work, spark action in oozie still can not find guava-16.0.1.jar. How does classpath.txt work? Do you know how to manage or modify the classpath.txt manually? Thanks!

 

 

 

avatar
Explorer

Hi zlpmichelle,

 

This problem came from the fact that I wasn't using a CDH artifact on my Maven dependency. If you package guava 16.0.1 into your jar, you still get this problem?

 

It is still a bit obscure for me how exactly the classpath.txt works and why it mixes several versions of the same API 😞

avatar
Rising Star

Agree, andreF. Thanks for your resposen.

 

I package guava 16.0.1 into my jar, it still get the same problem, I guess it has guava version confliction, since there are serveral different guava in CDH's classpath.txt.

 

I really confused about this. 

avatar
Super Collaborator

If you need a specific version of guava you can not just add it to the classpath. If you do you totally rely on the randomness that is in the class loaders. There is no guarantee that you will get the proper version of guava loaded.

 

First thing that you need to do is make sure that you get the proper version of guava loaded at all times. To do this the proper way is to shade (mvn) or shadow (gradle) your guava. Check the web on how to do this. It is really the only way to make sure you get the correct version and not break the rest of hadoop at the same time.

 

After that is done you need to use the class path addition as discussed earlier and make sure that you add your shaded version.

 

This i sthe only way to do this without being vulnerable for changes in the hadoop dependencies.

 

Wilfred

avatar
Rising Star

Thanks very much Wilfred for your helpful suggestion! We tried shadow before in gradle, at that time it still had the guava Nosuchmethod issue. Let's try shadow again with adding guava-16.0.1.jar in oozie sharelib to see whether it can work.

avatar
Rising Star

We generated shadow (gradle) to shadow guava-16.0.1 for sparktest.jar, and put the shadow (gradle) sparktest.jar into oozie's share lib classpath, then run the shawsaprktest.jar in CDH oozie, it throws following exception:


Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org.jets3t.service.ServiceException: Request Error: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: Request Error: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:478)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at org.apache.hadoop.fs.s3native.$Proxy48.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:468)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:272)
at org.apache.hadoop.fs.Globber.glob(Globber.java:151)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1653)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD.count(RDD.scala:1121)
at com.gridx.spark.MeterReadingLoader$.load(MeterReadingLoader.scala:120)
at com.gridx.spark.MeterReadingLoader$.main(MeterReadingLoader.scala:101)
at com.gridx.spark.MeterReadingLoader.main(MeterReadingLoader.scala)

 

 

According to http://quabr.com/20613854/exception-when-getting-a-list-of-buckets-on-s3-using-jets3t, after changing httpclient from 4.2.5 jar  to 4.2 jar in hdfs oozie shared lib, it throws following exception:


JA008: File does not exist: hdfs://ip-10-0-4-248.us-west-1.compute.internal:8020/user/oozie/share/lib/lib_20151201085935/spark/httpclient-4.2.5.jar

avatar
Super Collaborator

You can not just replace a file in HDFS and expect it to be picked up. The files will be localised during the run and there is a check to make sure that the files are were they should be. See the blog on how the sharelib works.

 

The OOTB version of Spark that we deliver with CDH does not throw the error that you show. It runs with the provided http client so I doubt that replacing the jar is the proper solution. It most likely is due to a mismatch in one of the other jars that results in this error.

 

Wilfred

avatar
Rising Star

Thanks Wilfred.

 

This issue is from http://community.cloudera.com/t5/Batch-Processing-and-Workflow/how-to-add-external-guava-16-0-1-jar-...

 

Does that mean there is no solution to let CDH5.5.0  Hue/Oozie support spark action(Spark 1.5.0) which will write data into Cassandra 2.1.11?

avatar
Super Collaborator

You most likely have pulled in too many dependencies when you build your application. When you look at the gradle documentation for building it shows that it behaves differently than the maven. When you pack up an application gradle includes far more dependencies than maven. This could have pulled in dependencies which you don't want or need.

Make sure that you only have in the application what you really need and that is not provided by hadoop. Search for gradle and dependency management. You need some way to define a "provided" scope in gradle.

 

Wilfred