Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark 2.0 App not working on cluster

avatar
Explorer

Hi all

 

We have Spark 2.0 (*) installed from the Cloudera parcel on our cluster (CDH 5.9.0).

When running a quite simple App which just reads in some csv files and does a groupBy I always receive errors.

The App is submitted with:

spark2-submit --class my_class myapp-1.0-SNAPSHOT.jar

And I receive the following error message:

java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateFormat; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 1

I figured out that there are multiple versions of lang3 installed with the Cloudera release and modified the spark2-submit to:

spark2-submit --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true --jars /var/opt/teradata/cloudera/parcels/CDH/jars/commons-lang3-3.3.2.jar --class my_class myapp-1.0-SNAPSHOT.jar

This way I cloud get rid of the first error message, but now I get:

java.lang.ClassCastException: cannot assign instance of org.apache.commons.lang3.time.FastDateFormat to field org.apache.spark.sql.execution.datasources.csv.CSVOptions.dateFormat of type org.apache.commons.lang3.time.FastDateFormat in instance of org.apache.spark.sql.execution.datasources.csv.CSVOptions

The App was written in Scala and compiled using Maven. The source code (**) and the maven pom file (***) are attached at the bottom of this post.

Does anybody have an idea on solving this issue?

Any help is highly appreciated!

 

Thanks a lot in advance!

Kind Regards

 

(*)

$spark2-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0.cloudera1
      /_/

Branch HEAD
Compiled by user jenkins on 2016-12-06T18:34:13Z
Revision 2389f44e0185f33969d782ed09b41ae45fe30324

(**)

import org.apache.spark.sql.SparkSession

object my_class {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .appName("myapp")
      .getOrCreate()

    val csv = spark.read.option("header", value = false).csv("/path/to/folder/with/some/csv/files/")

    val pivot = csv.groupBy("_c0").count()

    csv.take(10).foreach(println)
    pivot.take(10).foreach(println)
    spark.stop()
  }
}

(***)

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>de.lht.datalab.ingestion</groupId>
    <artifactId>myapp</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <scala.version.base>2.11</scala.version.base>
        <scala.version>${scala.version.base}.8</scala.version>
        <spark.version>2.0.0.cloudera1</spark.version>
    </properties>

    <repositories>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
    </repositories>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.version.base}</artifactId>
            <version>${spark.version}</version>
        </dependency>
    </dependencies>


    <build>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>
1 ACCEPTED SOLUTION

avatar
Master Collaborator

This is due to a difference in the version of commons-lang3 you use and the one Spark does, generally. See https://issues.apache.org/jira/browse/ZEPPELIN-1977 for example.

I believe you'll find that it's resolved in the latest Spark 2 release for CDH.

http://community.cloudera.com/t5/Community-News-Release/ANNOUNCE-Spark-2-0-Release-2/m-p/51464#M161

View solution in original post

4 REPLIES 4

avatar
Master Collaborator

This is due to a difference in the version of commons-lang3 you use and the one Spark does, generally. See https://issues.apache.org/jira/browse/ZEPPELIN-1977 for example.

I believe you'll find that it's resolved in the latest Spark 2 release for CDH.

http://community.cloudera.com/t5/Community-News-Release/ANNOUNCE-Spark-2-0-Release-2/m-p/51464#M161

avatar
Explorer

Thanks a lot.

With the given workaround at the end of the Zeppelin issue, it works for me now.

 

avatar
New Contributor

What is the solution? (I do not have an enterprise account and we may not be able to upgrade the cluster soon enough).

avatar
New Contributor

I am using Spark 2.4.0 CDH 6.3.4. I got the issue of java.lang.ClassCastException: cannot assign instance of org.apache.commons.lang3.time.FastDateFormat to field org.apache.spark.sql.catalyst.csv.CSVOptions.dateFormat of type org.apache.commons.lang3.time.FastDateFormat in instance of org.apache.spark.sql.catalyst.csv.CSVOptions

 

Caused by: java.lang.ClassCastException: cannot assign instance of org.apache.commons.lang3.time.FastDateFormat to field org.apache.spark.sql.catalyst.csv.CSVOptions.dateFormat of type org.apache.commons.lang3.time.FastDateFormat in instance of org.apache.spark.sql.catalyst.csv.CSVOptions
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2371)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:482)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:440)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Finally I able to resolve the issue. I was using org.apache.spark:spark-core_2.11:jar:2.4.0-cdh6.3.4:provided. Even though it is mentioned as provided, but it includes some of the transitive dependencies as scope compile. org.apache.commons:commons-lang3:jar:3.7 is one of those. If you provide commons-lang3 from outside it will create the problem as it gets packaged inside your fat jar. 

 

Therefore I forced few of the jars scope as provided explicitly as  listed below.

  1. org.apache.commons:commons-lang3:3.7
  2. org.apache.zookeeper:zookeeper:3.4.5-cdh6.3.4
  3. io.dropwizard.metrics:metrics-core:3.1.5
  4. com.fasterxml.jackson.core:jackson-databind:2.9.10.6
  5. org.apache.commons:commons-crypto:1.0.0

By doing this application is forced to use the commons-lang3 jar provided by the platform.

 

Pom snippet to solve the issue

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.core.version}</version>
<scope>provided</scope>
</dependency>
<!-- Declaring following dependencies explicitly as provided as they are not declared as provide as part of spark-core -->
<!-- Start -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.7</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.zookeeper</groupId>
<artifactId>zookeeper</artifactId>
<version>3.4.5-cdh6.3.4</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>io.dropwizard.metrics</groupId>
<artifactId>metrics-core</artifactId>
<version>3.1.5</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.9.10.6</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-crypto</artifactId>
<version>1.0.0</version>
<scope>provided</scope>
</dependency>
<!-- End -->