Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to provide a different dependency for RDD in spark?

Solved Go to solution
Highlighted

How to provide a different dependency for RDD in spark?

New Contributor

I have an application jar with the dependency okhttp 3.10.0, which has a dependency on okio 1.14.0.

Spark2 client provides Okhttp 2.14.0 and okio 1.4.0 in the /usr/hdp/current/spark2-client/jars directory. When running the spark application, I receive the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.lang.NoSuchMethodError: okio.BufferedSource.readUtf8LineStrict(J)Ljava/lang/String; at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215) at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) at okhttp3.RealCall.execute(RealCall.java:77)

....
at org.apache.spark.rdd.RDD$anonfun$foreachPartition$1$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$anonfun$foreachPartition$1$anonfun$apply$29.apply(RDD.scala:926) at org.apache.spark.SparkContext$anonfun$runJob$5.apply(SparkContext.scala:2069) at org.apache.spark.SparkContext$anonfun$runJob$5.apply(SparkContext.scala:2069) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

And:
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094) at org.apache.spark.rdd.RDD$anonfun$foreachPartition$1.apply(RDD.scala:926) at org.apache.spark.rdd.RDD$anonfun$foreachPartition$1.apply(RDD.scala:924) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)

This occurs even when I bring in the two dependencies during spark-submit with the following flag:

--packages com.squareup.okhttp3:okhttp:3.10.0

Any suggestions? I would like to be able to resolve these dependency conflicts for not just OkHttp but other classes that I bring into the spark application.

1 ACCEPTED SOLUTION

Accepted Solutions

Re: How to provide a different dependency for RDD in spark?

New Contributor

The solution was to use Maven Shade - as seen originally - and use the relocate classes option.

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <relocations>
            <relocation>
                <pattern>okio</pattern>
                <shadedPattern>com.shaded.okio</shadedPattern>
            </relocation>
        </relocations>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
    </configuration>
</plugin>

Make sure to use the relocation option on the package name of the classes which conflict with the ones present in the /jars directory in Spark. This will create a 'private copy' of this dependency for your application, with no potential of interference with the underlying spark dependencies. Only watch out for making your spark application too large if you add too many dependencies like this.

7 REPLIES 7

Re: How to provide a different dependency for RDD in spark?

New Contributor

This is the maven shade plugin that is being used for the project:

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration> </plugin>

Re: How to provide a different dependency for RDD in spark?

Super Guru

I recommend you build a uber jar. You can leave out spark packages from that jar as those libs will be provided during run time.

<dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-core_${scala.binary.version}</artifactId>
      <version>${spark.version}</version>
      <scope>provided</scope>
      <exclusions>
...

      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.2.1</version>
        <configuration>
            <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
        </configuration>
        <executions>
            <execution>
                <id>make-assembly</id> 
                <phase>package</phase>
                <goals>
                    <goal>single</goal>
                </goals>
            </execution>
        </executions>
      </plugin>   

Re: How to provide a different dependency for RDD in spark?

New Contributor

Hello Sunile Manjee (@sunile.manjee),

Thank you for reaching out!

We are already building an uber jar using the shade plugin for maven. That is what I posted in a comment above, sorry for the confusion!

We resolved the error. However, the method in which we resolved the error does not seem to be advisable. In the /usr/hdp/current/spark2-client/jars directory exists two jars, okhttp-2.4.0.jar and okio.1.4.0.jar. Our uber jar includes the dependencies okhttp3-3.10.0.jar and okio.1.14.0.jar. Looking at the stack trace above, you will notice that the error is thrown when accessing okio classes. What we realized is that okio.1.14.0.jar conlicts with the jar in /usr/hdp/current/spark2-client/jars, which is okio.1.4.0.jar (okhttp-3.10.0.jar does not because the dependency name is changed). Therefore, we concluded that the jars in /usr/hdp/current/spark2-client/jars are being preferred over the classes provided in our uber jar.

The solution that we used, which again is likely not a good one, was to move the okio-1.4.0.jar out of the /usr/hdp/current/spark2-client/jars directory. After this, spark-submit was using our dependency, okio-1.14.0.jar, and the application ran successfully as intended.

So my follow-up question is: What is the intended way to override classes found in /usr/hdp/current/spark2-client/jars? Is there a way in which we can prioritize our own dependencies for use over the ones provided in this directory?

Thank you very much for your time,

-Alex

Re: How to provide a different dependency for RDD in spark?

Super Guru

Take a look at

spark.driver.userClassPathFirst

and spark.executor.userClassPathFirst

https://spark.apache.org/docs/latest/configuration.html

Re: How to provide a different dependency for RDD in spark?

New Contributor

I attempted setting those configuration settings to true, but now I receive the following error:
Exception in thread "main" org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "hdfs" at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3266) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3286) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:123) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3337) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3305) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:467) at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1859) at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:68) at org.apache.spark.SparkContext.<init>(SparkContext.scala:529) at

...

sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$runMain(SparkSubmit.scala:782) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

What I found was suggested was here
https://stackoverflow.com/questions/17265002/hadoop-no-filesystem-for-scheme-file

But this solution did not work.

Re: How to provide a different dependency for RDD in spark?

New Contributor

Additionally, that feature appears to be for cluster mode, but we are intending to run in YARN mode. If this feature is only on cluster mode then it seems like it will be at a loss if we decided to run using this fix, as the changes would not be used.

Re: How to provide a different dependency for RDD in spark?

New Contributor

The solution was to use Maven Shade - as seen originally - and use the relocate classes option.

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <relocations>
            <relocation>
                <pattern>okio</pattern>
                <shadedPattern>com.shaded.okio</shadedPattern>
            </relocation>
        </relocations>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
    </configuration>
</plugin>

Make sure to use the relocation option on the package name of the classes which conflict with the ones present in the /jars directory in Spark. This will create a 'private copy' of this dependency for your application, with no potential of interference with the underlying spark dependencies. Only watch out for making your spark application too large if you add too many dependencies like this.

Don't have an account?
Coming from Hortonworks? Activate your account here