Support Questions

Find answers, ask questions, and share your expertise

Spark distributed classpath

New Contributor

We have Spark installed via Cloudera Manager on a YARN cluster. It appears there is a classpath.txt file in /etc/spark/conf that include list of jars that should be available on spark's distributed classpath. And spark-env.sh seems to be the on that's exporting this configuration. 

 

It is my understanding that cloudera manager creates the classpath.txt file. I would like how does cloudera manger evaluate the list of jars that go into this file, and is it something that can be controlled through cloudera manager.

 

Thank you!

1 ACCEPTED SOLUTION

Super Collaborator

yes CM generates this as part of the gateway (client config). The classpath text file is generated by CM based on the dependencies that are defined in the deployment.

This is not something you can change.

 

As you can see in the upstream docs we use a form of hadoop free distribution but we still only test this with CDH and the specific dependencies.

 

Does that explain what you are lookign for?

 

WIlfred

View solution in original post

18 REPLIES 18

Super Collaborator

For adding custom classes to the classpath you should use one of the two following options:
- add them via the command line options
- add them via the config

 

For the driver you have the option to use: --driver-class-path /path/to/file

 

Or for the the executor use

--conf "spark.executor.extraClassPath=/path/to/jar"


In spark-defaults.conf set the two values (or one if you only need it for one side
  spark.driver.extraClassPath
  spark.executor.extraClassPath

This can be done through the CM UI.

 

Depending on the exact thing you are doing you might see limitations of which option you can use.

 

Wilfred

New Contributor

Thank you for your response Wilfred. It sure helps me. However, my question was more towards understanding how classpath.txt file mentioned below is created? Does CM create this file on all nodes, is it something we can configure through CM?

 

08:42:43 $ ll /etc/spark/conf/
total 60
drwxr-xr-x 3 root root 4096   Aug 25 12:28 ./
drwxr-xr-x 3 root root 4096   Aug 25 12:28 ../
-rw-r--r-- 1 root root 29228 Aug 25 12:28 classpath.txt
-rw-r--r-- 1 root root 21       Aug 25 12:28 __cloudera_generation__
-rw-r--r-- 1 root root 550     Aug 25 12:28 log4j.properties
-rw-r--r-- 1 root root 800     Aug 25 12:28 spark-defaults.conf
-rw-r--r-- 1 root root 1122   Aug 25 12:28 spark-env.sh
drwxr-xr-x 2 root root 4096   Aug 25 12:28 yarn-conf/

 

 

 

Super Collaborator

yes CM generates this as part of the gateway (client config). The classpath text file is generated by CM based on the dependencies that are defined in the deployment.

This is not something you can change.

 

As you can see in the upstream docs we use a form of hadoop free distribution but we still only test this with CDH and the specific dependencies.

 

Does that explain what you are lookign for?

 

WIlfred

New Contributor

Thank you for the quick response, I really appreciate helping me clear my questions. 

 

The answer was exactly what I was looking for. It is automated and users cannot control the elements of classpath.txt file. 

 

Pardon my naive question, but can it pose a problem having different versions of same dependencies on classpath? 

 

Example:

 

09:39:34 $ cat /etc/spark/conf/classpath.txt | grep jersey-server
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p886.563/jars/jersey-server-1.9.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p886.563/jars/jersey-server-1.14.jar

 

Super Collaborator

It should not pose a problem. If it does let us know but we have not seen an issue with this.

 

Wilfred

New Contributor
Thank you! That definitely helps.

New Contributor

Actually I think i got an issue related to the fact that classpath.txt contains multiple versions of the same jar:

 

The issue is related to this jira :  https://issues.apache.org/jira/browse/SPARK-8332

 

And on /etc/spark/conf/classpath.txt :

 

-----------------------------

cat /etc/spark/conf/classpath.txt | grep jackson

/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-annotations-2.2.3.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-core-2.2.3.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-databind-2.2.3.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-annotations-2.3.0.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-core-2.3.1.jar
/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/jackson-databind-2.3.1.jar

-----------------------------

 

Somehow the classloader is pointing to the version 2.2.3 of jackson, where the method handledType() of the class BigDecimalDeserializer does not exist.

Similar errors may appears for jersey as well since the api changed a bit inbetween those versions.

 

Is that a way to solve this kind of issue in a proper way?

Explorer

Hi andreF,

 

I have the similar issue, did you fix the issue?

Explorer
I understand this is older post but I am getting same problem. Can you please provide solution if it is resolved for you?

Thanks

Explorer

Hi Wilfred,

 

 

I have the similar issue as andreF's, we have serval differnt guava in /etc/spark/conf/classpath.txtdo you know how to fix the issue?

 

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-11.0.2.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-11.0.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-14.0.1.jar

/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-16.0.1.jar

 

Our app needs to use guava-16.0.1.jar, so I add guava-16.0.1.jar into /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/, and add "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/jars/guava-16.0.1.jar" into /etc/spark/conf/classpath.txt.

 

However, it doesn't work, spark action in oozie still can not find guava-16.0.1.jar. How does classpath.txt work? Do you know how to manage or modify the classpath.txt manually? Thanks!

 

 

 

New Contributor

Hi zlpmichelle,

 

This problem came from the fact that I wasn't using a CDH artifact on my Maven dependency. If you package guava 16.0.1 into your jar, you still get this problem?

 

It is still a bit obscure for me how exactly the classpath.txt works and why it mixes several versions of the same API 😞

Explorer

Agree, andreF. Thanks for your resposen.

 

I package guava 16.0.1 into my jar, it still get the same problem, I guess it has guava version confliction, since there are serveral different guava in CDH's classpath.txt.

 

I really confused about this. 

Super Collaborator

If you need a specific version of guava you can not just add it to the classpath. If you do you totally rely on the randomness that is in the class loaders. There is no guarantee that you will get the proper version of guava loaded.

 

First thing that you need to do is make sure that you get the proper version of guava loaded at all times. To do this the proper way is to shade (mvn) or shadow (gradle) your guava. Check the web on how to do this. It is really the only way to make sure you get the correct version and not break the rest of hadoop at the same time.

 

After that is done you need to use the class path addition as discussed earlier and make sure that you add your shaded version.

 

This i sthe only way to do this without being vulnerable for changes in the hadoop dependencies.

 

Wilfred

Explorer

Thanks very much Wilfred for your helpful suggestion! We tried shadow before in gradle, at that time it still had the guava Nosuchmethod issue. Let's try shadow again with adding guava-16.0.1.jar in oozie sharelib to see whether it can work.

Explorer

We generated shadow (gradle) to shadow guava-16.0.1 for sparktest.jar, and put the shadow (gradle) sparktest.jar into oozie's share lib classpath, then run the shawsaprktest.jar in CDH oozie, it throws following exception:


Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, org.jets3t.service.ServiceException: Request Error: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: Request Error: java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:478)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at org.apache.hadoop.fs.s3native.$Proxy48.retrieveMetadata(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:468)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:272)
at org.apache.hadoop.fs.Globber.glob(Globber.java:151)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1653)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)

at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at org.apache.spark.rdd.RDD.count(RDD.scala:1121)
at com.gridx.spark.MeterReadingLoader$.load(MeterReadingLoader.scala:120)
at com.gridx.spark.MeterReadingLoader$.main(MeterReadingLoader.scala:101)
at com.gridx.spark.MeterReadingLoader.main(MeterReadingLoader.scala)

 

 

According to http://quabr.com/20613854/exception-when-getting-a-list-of-buckets-on-s3-using-jets3t, after changing httpclient from 4.2.5 jar  to 4.2 jar in hdfs oozie shared lib, it throws following exception:


JA008: File does not exist: hdfs://ip-10-0-4-248.us-west-1.compute.internal:8020/user/oozie/share/lib/lib_20151201085935/spark/httpclient-4.2.5.jar

Super Collaborator

You can not just replace a file in HDFS and expect it to be picked up. The files will be localised during the run and there is a check to make sure that the files are were they should be. See the blog on how the sharelib works.

 

The OOTB version of Spark that we deliver with CDH does not throw the error that you show. It runs with the provided http client so I doubt that replacing the jar is the proper solution. It most likely is due to a mismatch in one of the other jars that results in this error.

 

Wilfred

Explorer

Thanks Wilfred.

 

This issue is from http://community.cloudera.com/t5/Batch-Processing-and-Workflow/how-to-add-external-guava-16-0-1-jar-...

 

Does that mean there is no solution to let CDH5.5.0  Hue/Oozie support spark action(Spark 1.5.0) which will write data into Cassandra 2.1.11?

Super Collaborator

You most likely have pulled in too many dependencies when you build your application. When you look at the gradle documentation for building it shows that it behaves differently than the maven. When you pack up an application gradle includes far more dependencies than maven. This could have pulled in dependencies which you don't want or need.

Make sure that you only have in the application what you really need and that is not provided by hadoop. Search for gradle and dependency management. You need some way to define a "provided" scope in gradle.

 

Wilfred

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.