question Re: Spark distributed classpath in Support Questions

Spark distributed classpath

NT — Fri, 16 Sep 2022 09:39:21 GMT

We have Spark installed via Cloudera Manager on a YARN cluster. It appears there is a classpath.txt file in /etc/spark/conf that include list of jars that should be available on spark's distributed classpath. And spark-env.sh seems to be the on that's exporting this configuration.

It is my understanding that cloudera manager creates the classpath.txt file. I would like how does cloudera manger evaluate the list of jars that go into this file, and is it something that can be controlled through cloudera manager.

Thank you!

Re: Spark distributed classpath

Wilfred — Tue, 01 Sep 2015 12:55:50 GMT

For adding custom classes to the classpath you should use one of the two following options:
- add them via the command line options
- add them via the config

For the driver you have the option to use: --driver-class-path /path/to/file

Or for the the executor use

--conf "spark.executor.extraClassPath=/path/to/jar"

In spark-defaults.conf set the two values (or one if you only need it for one side
spark.driver.extraClassPath
spark.executor.extraClassPath

This can be done through the CM UI.

Depending on the exact thing you are doing you might see limitations of which option you can use.

Wilfred

Re: Spark distributed classpath

NT — Tue, 01 Sep 2015 13:44:51 GMT

Thank you for your response Wilfred. It sure helps me. However, my question was more towards understanding how classpath.txt file mentioned below is created? Does CM create this file on all nodes, is it something we can configure through CM?

08:42:43 $ ll /etc/spark/conf/
total 60
drwxr-xr-x 3 root root 4096 Aug 25 12:28 ./
drwxr-xr-x 3 root root 4096 Aug 25 12:28 ../
-rw-r--r-- 1 root root 29228 Aug 25 12:28 classpath.txt
-rw-r--r-- 1 root root 21 Aug 25 12:28 __cloudera_generation__
-rw-r--r-- 1 root root 550 Aug 25 12:28 log4j.properties
-rw-r--r-- 1 root root 800 Aug 25 12:28 spark-defaults.conf
-rw-r--r-- 1 root root 1122 Aug 25 12:28 spark-env.sh
drwxr-xr-x 2 root root 4096 Aug 25 12:28 yarn-conf/

Re: Spark distributed classpath

Wilfred — Tue, 01 Sep 2015 14:35:59 GMT

yes CM generates this as part of the gateway (client config). The classpath text file is generated by CM based on the dependencies that are defined in the deployment.

This is not something you can change.

As you can see in the upstream docs we use a form of hadoop free distribution but we still only test this with CDH and the specific dependencies.

Does that explain what you are lookign for?

WIlfred

Re: Spark distributed classpath

NT — Tue, 01 Sep 2015 14:43:11 GMT

Thank you for the quick response, I really appreciate helping me clear my questions.

The answer was exactly what I was looking for. It is automated and users cannot control the elements of classpath.txt file.

Pardon my naive question, but can it pose a problem having different versions of same dependencies on classpath?

Example:

09:39:34 $ cat /etc/spark/conf/classpath.txt | grep jersey-server
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p886.563/jars/jersey-server-1.9.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p886.563/jars/jersey-server-1.14.jar

Re: Spark distributed classpath

Wilfred — Thu, 03 Sep 2015 18:48:42 GMT

It should not pose a problem. If it does let us know but we have not seen an issue with this.

Wilfred

Re: Spark distributed classpath

NT — Thu, 03 Sep 2015 18:51:47 GMT

Thank you! That definitely helps.

Re: Spark distributed classpath

andreF — Wed, 16 Sep 2015 13:21:25 GMT

Actually I think i got an issue related to the fact that classpath.txt contains multiple versions of the same jar:

The issue is related to this jira : https://issues.apache.org/jira/browse/SPARK-8332

And on /etc/spark/conf/classpath.txt :

-----------------------------

cat /etc/spark/conf/classpath.txt | grep jackson

/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-annotations-2.2.3.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-core-2.2.3.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-databind-2.2.3.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-annotations-2.3.0.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-core-2.3.1.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-databind-2.3.1.jar

-----------------------------

Somehow the classloader is pointing to the version 2.2.3 of jackson, where the method handledType() of the class BigDecimalDeserializer does not exist.

Similar errors may appears for jersey as well since the api changed a bit inbetween those versions.

Is that a way to solve this kind of issue in a proper way?

Re: Spark distributed classpath

zlpmichelle — Thu, 25 Feb 2016 17:22:40 GMT

Hi andreF,

I have the similar issue, did you fix the issue?

Re: Spark distributed classpath

zlpmichelle — Thu, 25 Feb 2016 17:30:26 GMT

Hi Wilfred,

I have the similar issue as andreF's, we have serval differnt guava in /etc/spark/conf/classpath.txt, do you know how to fix the issue?

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-11.0.2.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-11.0.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-14.0.1.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-16.0.1.jar

Our app needs to use guava-16.0.1.jar, so I add guava-16.0.1.jar into /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/, and add "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-16.0.1.jar" into /etc/spark/conf/classpath.txt.

However, it doesn't work, spark action in oozie still can not find guava-16.0.1.jar. How does classpath.txt work? Do you know how to manage or modify the classpath.txt manually? Thanks!

Re: Spark distributed classpath

andreF — Thu, 25 Feb 2016 18:07:06 GMT

Hi zlpmichelle,

This problem came from the fact that I wasn't using a CDH artifact on my Maven dependency. If you package guava 16.0.1 into your jar, you still get this problem?

It is still a bit obscure for me how exactly the classpath.txt works and why it mixes several versions of the same API 😞

Re: Spark distributed classpath

zlpmichelle — Thu, 25 Feb 2016 18:43:11 GMT

Agree, andreF. Thanks for your resposen.

I package guava 16.0.1 into my jar, it still get the same problem, I guess it has guava version confliction, since there are serveral different guava in CDH's classpath.txt.

I really confused about this.

Re: Spark distributed classpath

Wilfred — Fri, 26 Feb 2016 04:56:40 GMT

If you need a specific version of guava you can not just add it to the classpath. If you do you totally rely on the randomness that is in the class loaders. There is no guarantee that you will get the proper version of guava loaded.

First thing that you need to do is make sure that you get the proper version of guava loaded at all times. To do this the proper way is to shade (mvn) or shadow (gradle) your guava. Check the web on how to do this. It is really the only way to make sure you get the correct version and not break the rest of hadoop at the same time.

After that is done you need to use the class path addition as discussed earlier and make sure that you add your shaded version.

This i sthe only way to do this without being vulnerable for changes in the hadoop dependencies.

Wilfred

Re: Spark distributed classpath

zlpmichelle — Fri, 26 Feb 2016 09:35:30 GMT

Thanks very much Wilfred for your helpful suggestion! We tried shadow before in gradle, at that time it still had the guava Nosuchmethod issue. Let's try shadow again with adding guava-16.0.1.jar in oozie sharelib to see whether it can work.

Re: Spark distributed classpath

zlpmichelle — Tue, 01 Mar 2016 05:07:37 GMT

We generated shadow (gradle) to shadow guava-16.0.1 for sparktest.jar, and put the shadow (gradle) sparktest.jar into oozie's share lib classpath, then run the shawsaprktest.jar in CDH oozie, it throws following exception:

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org.jets3t.service.ServiceException: Request Error: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: Request Error: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:478)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at org.apache.hadoop.fs.s3native.$Proxy48.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:468)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:272)
at org.apache.hadoop.fs.Globber.glob(Globber.java:151)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1653)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD.count(RDD.scala:1121)
at com.gridx.spark.MeterReadingLoader$.load(MeterReadingLoader.scala:120)
at com.gridx.spark.MeterReadingLoader$.main(MeterReadingLoader.scala:101)
at com.gridx.spark.MeterReadingLoader.main(MeterReadingLoader.scala)

According to http://quabr.com/20613854/exception-when-getting-a-list-of-buckets-on-s3-using-jets3t, after changing httpclient from 4.2.5 jar to 4.2 jar in hdfs oozie shared lib, it throws following exception:

JA008: File does not exist: hdfs://ip-10-0-4-248.us-west-1.compute.internal:8020/user/oozie/share/lib/lib_20151201085935/spark/httpclient-4.2.5.jar

Re: Spark distributed classpath

Wilfred — Tue, 01 Mar 2016 09:03:53 GMT

You can not just replace a file in HDFS and expect it to be picked up. The files will be localised during the run and there is a check to make sure that the files are were they should be. See the blog on how the sharelib works.

The OOTB version of Spark that we deliver with CDH does not throw the error that you show. It runs with the provided http client so I doubt that replacing the jar is the proper solution. It most likely is due to a mismatch in one of the other jars that results in this error.

Wilfred

Re: Spark distributed classpath

zlpmichelle — Thu, 03 Mar 2016 03:29:01 GMT

Thanks Wilfred.

This issue is from http://community.cloudera.com/t5/Batch-Processing-and-Workflow/how-to-add-external-guava-16-0-1-jar-in-CDH-oozie-classpath/m-p/37803#U37803

Does that mean there is no solution to let CDH5.5.0 Hue/Oozie support spark action(Spark 1.5.0) which will write data into Cassandra 2.1.11?

Re: Spark distributed classpath

Wilfred — Mon, 07 Mar 2016 00:12:28 GMT

You most likely have pulled in too many dependencies when you build your application. When you look at the gradle documentation for building it shows that it behaves differently than the maven. When you pack up an application gradle includes far more dependencies than maven. This could have pulled in dependencies which you don't want or need.

Make sure that you only have in the application what you really need and that is not provided by hadoop. Search for gradle and dependency management. You need some way to define a "provided" scope in gradle.

Wilfred

Re: Spark distributed classpath

SandeepP — Thu, 20 Sep 2018 04:16:41 GMT

I understand this is older post but I am getting same problem. Can you please provide solution if it is resolved for you?

Thanks