Support Questions

Find answers, ask questions, and share your expertise

Spark 2.x + Tika: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamF

avatar
New Contributor

Hi,

 

I am trying to resolve a spark-submit classpath runtime issue for an Apache Tika (>v 1.14) parsing job.  The problem seems to involve spark-submit classpath vs my uber-jar. 

 

Platforms: CDH 5.15 (Spark 2.3 added via CDH docs) and CDH 6 (Spark 2.2 bundled in CDH 6)

 

I've tried / reviewed:

(Cloudera) Where does spark-submit look for Jar files?

(stackoverflow) resolving-dependency-problems-in-apache-spark

 

Highlights:

  • Java 8 / Scala 2.11
  • I'm building an uber-jar and calling that uber-jar via spark-submit
  • I've tried adding --jars option to spark-submit call (see further down in this post)
  • I've tried adding --conf spark.driver.userClassPathFirst=true && --conf spark.executor.userClassPathFirst=true to spark-submit call (see further down in this post):
$ spark-submit --master local[*] --class com.example.App --conf spark.executor.userClassPathFirst=true ./target/uber-tikaTest-1.19.jar

18/09/25 13:35:55 ERROR util.Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:72) at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70) at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1307) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2136) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/09/25 13:35:55 ERROR util.Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:72) at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70) at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1307) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2136) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

 

Below the following error message are files for:

  • build-and-run.sh script (calls spark-submit -- notes about options included)
  • sample app
  • pom.xml
  • mvn dependency tree output (which shows the "missing" commons-compress library is included within the uber-jar)

 

The error at runtime:

18/09/25 11:47:39 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
	at com.example.App$.tikaAutoDetectParser(App.scala:55)
	at com.example.App$$anonfun$1.apply(App.scala:69)
	at com.example.App$$anonfun$1.apply(App.scala:69)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1799)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
18/09/25 11:47:39 ERROR executor.Executor: Exception in task 5.0 in stage 0.0 (TID 5)
java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String;
	at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160)
	at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104)
	at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116)
	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
	at com.example.App$.tikaAutoDetectParser(App.scala:55)
	at com.example.App$$anonfun$1.apply(App.scala:69)
	at com.example.App$$anonfun$1.apply(App.scala:69)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1799)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

 

 

build-and-run.sh:

Notes:

  • I've tried adding the --conf flags for userClassPathFirst in both master and yarn configs below,
  • using the --jar flag to specify the uber-jar generated from mvn compile with the pom.xml (provided further down in the post)

 

#!/bin/bash
mvn compile

if true
then
spark-submit --master local[*] --class com.example.App ./target/uber-tikaTest-1.19.jar
fi

# tried the using the userClass flags for driver and executor for above and below calls to spark-submit
# --conf spark.driver.userClassPathFirst=true \
# --conf spark.executor.userClassPathFirst=true \

if false 
then
spark-submit --class com.example.App \
 --master yarn \
 --packages org.apache.commons:commons-compress:1.18 \
 --jars ./target/uber-tikaTest-1.19.jar \
 --num-executors 2 \
 --executor-memory 1024m \
 --executor-cores 2 \
 --driver-memory 2048m \
 --driver-cores 1 \
 ./target/uber-tikaTest-1.19.jar
fi

 

Sample App:

 

package com.example
////////// Tika Imports
import org.apache.tika.metadata.Metadata
import org.apache.tika.parser.AutoDetectParser
import org.apache.tika.sax.BodyContentHandler
////////// Java HTTP Imports 
import java.net.URL;
import java.net.HttpURLConnection
import scala.collection.JavaConverters._
import scala.collection.mutable._
////////// Spark Imports 
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.{Row,SparkSession}


object App {
  case class InputStreamData(sourceURL: String, headerFields: Map[String,List[String]], inputStream: java.io.InputStream)
    
  def openUrlStream(sourceURL:String,apiKey:String):(InputStreamData) = {
    try {
  	 val url = new URL(sourceURL)
         val urlConnection = url.openConnection().asInstanceOf[HttpURLConnection] 
  	 urlConnection.setInstanceFollowRedirects(true)
         val headerFields = urlConnection.getHeaderFields()
         val input = urlConnection.getInputStream()
 	 InputStreamData(sourceURL, headerFields.asScala.map(x => (x._1,x._2.asScala.toList)), input)
    }
      catch {
      case e: Exception => {
        println("**********************************************************************************************")
        println("PARSEURL: INVALID URL: " + sourceURL)
        println(e.toString())
        println("**********************************************************************************************")
       
        InputStreamData(sourceURL, Map("ERROR" -> List("ERROR")), null)
      }
    }
  }
 
  def tikaAutoDetectParser(inputStream:java.io.InputStream):String = {
    var parser = new AutoDetectParser();
    var handler = new BodyContentHandler(-1);
    var metadata = new Metadata();
    parser.parse(inputStream, handler, metadata);
    return handler.toString()
  }

  def main(args : Array[String]) {
    var sparkConf = new SparkConf().setAppName("tika-1.19-test")
    val sc = new SparkContext(sparkConf) 
    val spark = SparkSession.builder.config(sparkConf).getOrCreate()
    println("HELLO!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
    var urls = List("http://www.pdf995.com/samples/pdf.pdf", "https://www.amd.com/en", "http://jeroen.github.io/images/testocr.png")
          
    var rdd = sc.parallelize(urls)
    var parsed = rdd.map(x => tikaAutoDetectParser(openUrlStream(x,"").inputStream))
    println(parsed.count)
  }
}

 

 

pom.xml (builds uber-jar):

 

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>tikaTest</artifactId>
  <version>1.19</version>
  <name>${project.artifactId}</name>
  <description>Testing tika 1.19 with CDH 6 and 5.x, Spark 2.x, Scala 2.11.x</description>
  <inceptionYear>2018</inceptionYear>
  <licenses>
    <license>
      <name>My License</name>
      <url>http://....</url>
      <distribution>repo</distribution>
    </license>
  </licenses>


 <repositories>
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  </repositories>

<profiles>
	<profile>
		<id>scala-2.11.12</id>
		<activation>
			<activeByDefault>true</activeByDefault>
		</activation>
		<properties>
			<scalaVersion>2.11.12</scalaVersion>
			<scalaBinaryVersion>2.11.12</scalaBinaryVersion>
		</properties>
		<dependencies>
			<!-- ************************************************************************** -->
			<!-- GOOD DEPENDENCIES +++++++++++++++++++++++++++++++++++++ -->
			<!-- ************************************************************************** -->

			<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-compress -->
			<dependency>
   				<groupId>org.apache.commons</groupId>
   				<artifactId>commons-compress</artifactId>
   				<version>1.18</version>
			</dependency>
						
			<!-- *************** CDH flavored dependencies ***********************************************-->
			<!-- https://www.cloudera.com/documentation/spark2/latest/topics/spark2_packaging.html#versions -->
			<dependency>
   				<groupId>org.apache.spark</groupId>
   				<artifactId>spark-core_2.11</artifactId>
   				<version>2.2.0.cloudera3</version>
  				<!-- have tried scope provided / compile -->
   				<!--<scope>provided</scope>-->
			</dependency>
			<dependency>
    				<groupId>org.apache.spark</groupId>
    				<artifactId>spark-sql_2.11</artifactId>
    				<version>2.2.0.cloudera3</version>
    				<!-- have tried scope provided / compile -->
    				<!--<scope>provided</scope>-->
				</dependency>
												
				<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
				<dependency>
				    <groupId>org.apache.tika</groupId>
    				<artifactId>tika-core</artifactId>
    				<version>1.19</version>
				</dependency>
				
				<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
				<dependency>
					<groupId>org.apache.tika</groupId>
					<artifactId>tika-parsers</artifactId>
					<version>1.19</version>
				</dependency>
				
				<!-- https://mvnrepository.com/artifact/javax.ws.rs/javax.ws.rs-api -->
				<dependency>
    				<groupId>javax.ws.rs</groupId>
    				<artifactId>javax.ws.rs-api</artifactId>
    				<version>2.1.1</version>
    			</dependency>
        
			<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
				<dependency>
    				<groupId>org.scala-lang</groupId>
    				<artifactId>scala-library</artifactId>
    				<version>2.11.12</version>
				</dependency>
				
			
			<!-- **************************************************************************************************************************
			**************************** alternative dependencies that have been tried and yield same Tika error***************************
			*******************************************************************************************************************************-->
			<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
				<!--
				<dependency>
    				<groupId>org.apache.spark</groupId>
    				<artifactId>spark-core_2.11</artifactId>
    				<version>2.2.0</version>
				</dependency>
				-->
				
			<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
				<!--
				<dependency>
    				<groupId>org.apache.spark</groupId>
    				<artifactId>spark-sql_2.11</artifactId>
    				<version>2.2.0</version>
				</dependency>
				-->
			
			</dependencies>
		</profile>
	</profiles>

  
	<build>
		<sourceDirectory>src/main/scala</sourceDirectory>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.5.1</version>
				<configuration>
					<source>1.8</source>
					<target>1.8</target>
				</configuration>
			</plugin>
			<plugin>
				<groupId>net.alchim31.maven</groupId>
				<artifactId>scala-maven-plugin</artifactId>
				<version>3.2.2</version>
				<executions>
					<execution>
						<goals>
							<goal>compile</goal>
							<goal>testCompile</goal>
						</goals>
					</execution>
				</executions>
				<configuration>
					<args>
						<!-- work-around for https://issues.scala-lang.org/browse/SI-8358 -->
						<arg>-nobootcp</arg>
					</args>
				</configuration>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-shade-plugin</artifactId>
				<version>3.1.1</version>
				<executions>
					<execution>
						<phase>package</phase>
						<goals>
							<goal>shade</goal>
						</goals>
					</execution>
				</executions>
				<configuration>
					<filters>
						<filter>
							<artifact>*:*</artifact>
							<excludes>
								<exclude>META-INF/*.SF</exclude>
								<exclude>META-INF/*.DSA</exclude>
								<exclude>META-INF/*.RSA</exclude>
							</excludes>
						</filter>
					</filters>
					<finalName>uber-${project.artifactId}-${project.version}</finalName>
				</configuration>
			</plugin>
		</plugins>
	</build>
</project>

 

mvn dependency tree:

Notes:

  • $ mvn dependency:tree -Ddetail=true | grep compress

 

[INFO] +- org.apache.commons:commons-compress:jar:1.18:compile
[INFO] | +- com.ning:compress-lzf:jar:1.0.3:compile

 

 

  • $ mvn dependency:tree -Ddetail=true | grep commons

 

[INFO] +- org.apache.commons:commons-compress:jar:1.18:compile
[INFO] | | | \- commons-collections:commons-collections:jar:3.2.2:compile
[INFO] | | | +- commons-cli:commons-cli:jar:1.2:compile
[INFO] | | | +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] | | | +- commons-configuration:commons-configuration:jar:1.6:compile
[INFO] | | | | +- commons-digester:commons-digester:jar:1.8:compile
[INFO] | | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO] | | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO] | +- org.apache.commons:commons-lang3:jar:3.5:compile
[INFO] | +- org.apache.commons:commons-math3:jar:3.4.1:compile
[INFO] | +- commons-net:commons-net:jar:2.2:compile
[INFO] | +- org.apache.commons:commons-crypto:jar:1.0.0:compile
[INFO] | | +- org.codehaus.janino:commons-compiler:jar:3.0.8:compile
[INFO] | | \- commons-lang:commons-lang:jar:2.6:compile
[INFO] | +- commons-codec:commons-codec:jar:1.11:compile
[INFO] | | \- org.apache.commons:commons-collections4:jar:4.2:compile
[INFO] | +- org.apache.commons:commons-exec:jar:1.3:compile
[INFO] | +- commons-io:commons-io:jar:2.6:compile
[INFO] | +- org.apache.commons:commons-csv:jar:1.5:compile

 

 

  • Full listing:

 

$ mvn dependency:tree -Ddetail=true

[INFO] com.example:tikaTest:jar:1.19 [INFO] +- org.apache.commons:commons-compress:jar:1.18:compile [INFO] +- org.apache.spark:spark-core_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.avro:avro:jar:1.7.6-cdh5.13.3:compile [INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile [INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] | | \- com.thoughtworks.paranamer:paranamer:jar:2.3:compile [INFO] | +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6-cdh5.13.3:compile [INFO] | | +- org.apache.avro:avro-ipc:jar:1.7.6-cdh5.13.3:compile [INFO] | | | \- commons-collections:commons-collections:jar:3.2.2:compile [INFO] | | \- org.apache.avro:avro-ipc:jar:tests:1.7.6-cdh5.13.3:compile [INFO] | +- com.twitter:chill_2.11:jar:0.8.0:compile [INFO] | | \- com.esotericsoftware:kryo-shaded:jar:3.0.3:compile [INFO] | | +- com.esotericsoftware:minlog:jar:1.3.0:compile [INFO] | | \- org.objenesis:objenesis:jar:2.1:compile [INFO] | +- com.twitter:chill-java:jar:0.8.0:compile [INFO] | +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:compile [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.13.3:compile [INFO] | | +- org.apache.hadoop:hadoop-common:jar:2.6.0-cdh5.13.3:compile [INFO] | | | +- commons-cli:commons-cli:jar:1.2:compile [INFO] | | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | | +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] | | | +- commons-configuration:commons-configuration:jar:1.6:compile [INFO] | | | | +- commons-digester:commons-digester:jar:1.8:compile [INFO] | | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile [INFO] | | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile [INFO] | | | +- org.apache.hadoop:hadoop-auth:jar:2.6.0-cdh5.13.3:compile [INFO] | | | | \- org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:compile [INFO] | | | | +- org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:compile [INFO] | | | | +- org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:compile [INFO] | | | | \- org.apache.directory.api:api-util:jar:1.0.0-M20:compile [INFO] | | | +- org.apache.curator:curator-client:jar:2.7.1:compile [INFO] | | | \- org.apache.htrace:htrace-core4:jar:4.0.1-incubating:compile [INFO] | | +- org.apache.hadoop:hadoop-hdfs:jar:2.6.0-cdh5.13.3:compile [INFO] | | | +- org.mortbay.jetty:jetty-util:jar:6.1.26.cloudera.4:compile [INFO] | | | \- xerces:xercesImpl:jar:2.9.1:compile [INFO] | | | \- xml-apis:xml-apis:jar:1.3.04:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.6.0-cdh5.13.3:compile [INFO] | | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.6.0-cdh5.13.3:compile [INFO] | | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.6.0-cdh5.13.3:compile [INFO] | | | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.6.0-cdh5.13.3:compile [INFO] | | | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.6.0-cdh5.13.3:compile [INFO] | | +- org.apache.hadoop:hadoop-yarn-api:jar:2.6.0-cdh5.13.3:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.0-cdh5.13.3:compile [INFO] | | | \- org.apache.hadoop:hadoop-yarn-common:jar:2.6.0-cdh5.13.3:compile [INFO] | | | +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.8:compile [INFO] | | | \- org.codehaus.jackson:jackson-xc:jar:1.8.8:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.6.0-cdh5.13.3:compile [INFO] | | +- org.apache.hadoop:hadoop-aws:jar:2.6.0-cdh5.13.3:compile [INFO] | | | \- com.amazonaws:aws-java-sdk-bundle:jar:1.11.134:compile [INFO] | | \- org.apache.hadoop:hadoop-annotations:jar:2.6.0-cdh5.13.3:compile [INFO] | +- org.apache.spark:spark-launcher_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.spark:spark-network-common_2.11:jar:2.2.0.cloudera3:compile [INFO] | | \- org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile [INFO] | +- org.apache.spark:spark-network-shuffle_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.spark:spark-unsafe_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.9.0:compile [INFO] | | +- org.apache.httpcomponents:httpcore:jar:4.1.2:compile [INFO] | | \- com.jamesmurty.utils:java-xmlbuilder:jar:0.4:compile [INFO] | +- org.apache.curator:curator-recipes:jar:2.7.1:compile [INFO] | | +- org.apache.curator:curator-framework:jar:2.7.1:compile [INFO] | | +- org.apache.zookeeper:zookeeper:jar:3.4.6:compile [INFO] | | \- com.google.guava:guava:jar:16.0.1:compile [INFO] | +- javax.servlet:javax.servlet-api:jar:3.1.0:compile [INFO] | +- org.apache.commons:commons-lang3:jar:3.5:compile [INFO] | +- org.apache.commons:commons-math3:jar:3.4.1:compile [INFO] | +- com.google.code.findbugs:jsr305:jar:1.3.9:compile [INFO] | +- org.slf4j:slf4j-api:jar:1.7.5:compile [INFO] | +- org.slf4j:jul-to-slf4j:jar:1.7.5:compile [INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.5:compile [INFO] | +- log4j:log4j:jar:1.2.17:compile [INFO] | +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile [INFO] | +- com.ning:compress-lzf:jar:1.0.3:compile [INFO] | +- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile [INFO] | +- net.jpountz.lz4:lz4:jar:1.3.0:compile [INFO] | +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:compile [INFO] | +- commons-net:commons-net:jar:2.2:compile [INFO] | +- org.json4s:json4s-jackson_2.11:jar:3.2.11:compile [INFO] | | \- org.json4s:json4s-core_2.11:jar:3.2.11:compile [INFO] | | +- org.json4s:json4s-ast_2.11:jar:3.2.11:compile [INFO] | | \- org.scala-lang:scalap:jar:2.11.0:compile [INFO] | | \- org.scala-lang:scala-compiler:jar:2.11.0:compile [INFO] | | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.1:compile [INFO] | | \- org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.1:compile [INFO] | +- org.glassfish.jersey.core:jersey-client:jar:2.22.2:compile [INFO] | | +- org.glassfish.hk2:hk2-api:jar:2.4.0-b34:compile [INFO] | | | +- org.glassfish.hk2:hk2-utils:jar:2.4.0-b34:compile [INFO] | | | \- org.glassfish.hk2.external:aopalliance-repackaged:jar:2.4.0-b34:compile [INFO] | | +- org.glassfish.hk2.external:javax.inject:jar:2.4.0-b34:compile [INFO] | | \- org.glassfish.hk2:hk2-locator:jar:2.4.0-b34:compile [INFO] | | \- org.javassist:javassist:jar:3.18.1-GA:compile [INFO] | +- org.glassfish.jersey.core:jersey-common:jar:2.22.2:compile [INFO] | | +- javax.annotation:javax.annotation-api:jar:1.2:compile [INFO] | | +- org.glassfish.jersey.bundles.repackaged:jersey-guava:jar:2.22.2:compile [INFO] | | \- org.glassfish.hk2:osgi-resource-locator:jar:1.0.1:compile [INFO] | +- org.glassfish.jersey.core:jersey-server:jar:2.22.2:compile [INFO] | | +- org.glassfish.jersey.media:jersey-media-jaxb:jar:2.22.2:compile [INFO] | | \- javax.validation:validation-api:jar:1.1.0.Final:compile [INFO] | +- org.glassfish.jersey.containers:jersey-container-servlet:jar:2.22.2:compile [INFO] | +- org.glassfish.jersey.containers:jersey-container-servlet-core:jar:2.22.2:compile [INFO] | +- io.netty:netty-all:jar:4.0.43.Final:compile [INFO] | +- io.netty:netty:jar:3.9.9.Final:compile [INFO] | +- com.clearspring.analytics:stream:jar:2.7.0:compile [INFO] | +- io.dropwizard.metrics:metrics-core:jar:3.1.2:compile [INFO] | +- io.dropwizard.metrics:metrics-jvm:jar:3.1.2:compile [INFO] | +- io.dropwizard.metrics:metrics-json:jar:3.1.2:compile [INFO] | +- io.dropwizard.metrics:metrics-graphite:jar:3.1.2:compile [INFO] | +- com.fasterxml.jackson.core:jackson-databind:jar:2.6.5:compile [INFO] | +- com.fasterxml.jackson.module:jackson-module-scala_2.11:jar:2.6.5:compile [INFO] | | +- org.scala-lang:scala-reflect:jar:2.11.7:compile [INFO] | | \- com.fasterxml.jackson.module:jackson-module-paranamer:jar:2.6.5:compile [INFO] | +- org.apache.ivy:ivy:jar:2.4.0:compile [INFO] | +- oro:oro:jar:2.0.8:compile [INFO] | +- net.razorvine:pyrolite:jar:4.13:compile [INFO] | +- net.sf.py4j:py4j:jar:0.10.7:compile [INFO] | +- org.apache.spark:spark-tags_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.commons:commons-crypto:jar:1.0.0:compile [INFO] | \- org.spark-project.spark:unused:jar:1.0.0:compile [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- com.univocity:univocity-parsers:jar:2.2.1:compile [INFO] | +- org.apache.spark:spark-sketch_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.spark:spark-catalyst_2.11:jar:2.2.0.cloudera3:compile [INFO] | | +- org.codehaus.janino:janino:jar:3.0.8:compile [INFO] | | +- org.codehaus.janino:commons-compiler:jar:3.0.8:compile [INFO] | | \- org.antlr:antlr4-runtime:jar:4.5.3:compile [INFO] | +- com.twitter:parquet-column:jar:1.5.0-cdh5.13.3:compile [INFO] | | +- com.twitter:parquet-common:jar:1.5.0-cdh5.13.3:compile [INFO] | | \- com.twitter:parquet-encoding:jar:1.5.0-cdh5.13.3:compile [INFO] | +- com.twitter:parquet-hadoop:jar:1.5.0-cdh5.13.3:compile [INFO] | | +- com.twitter:parquet-format:jar:2.1.0-cdh5.13.3:compile [INFO] | | \- com.twitter:parquet-jackson:jar:1.5.0-cdh5.13.3:compile [INFO] | \- com.twitter:parquet-avro:jar:1.5.0-cdh5.13.3:compile [INFO] | \- it.unimi.dsi:fastutil:jar:7.2.1:compile [INFO] +- org.apache.tika:tika-core:jar:1.19:compile [INFO] +- org.apache.tika:tika-parsers:jar:1.19:compile [INFO] | +- javax.xml.bind:jaxb-api:jar:2.3.0:compile [INFO] | +- com.sun.xml.bind:jaxb-core:jar:2.3.0:compile [INFO] | +- com.sun.xml.bind:jaxb-impl:jar:2.3.0:compile [INFO] | +- javax.activation:activation:jar:1.1.1:compile [INFO] | +- org.gagravarr:vorbis-java-tika:jar:0.8:compile [INFO] | +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile [INFO] | | \- commons-lang:commons-lang:jar:2.6:compile [INFO] | +- com.healthmarketscience.jackcess:jackcess-encrypt:jar:2.1.4:compile [INFO] | +- org.tallison:jmatio:jar:1.4:compile [INFO] | +- org.apache.james:apache-mime4j-core:jar:0.8.2:compile [INFO] | +- org.apache.james:apache-mime4j-dom:jar:0.8.2:compile [INFO] | +- org.tukaani:xz:jar:1.8:compile [INFO] | +- com.epam:parso:jar:2.0.9:compile [INFO] | +- org.brotli:dec:jar:0.1.2:compile [INFO] | +- commons-codec:commons-codec:jar:1.11:compile [INFO] | +- org.apache.pdfbox:pdfbox:jar:2.0.11:compile [INFO] | | \- org.apache.pdfbox:fontbox:jar:2.0.11:compile [INFO] | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.11:compile [INFO] | +- org.apache.pdfbox:jempbox:jar:1.8.15:compile [INFO] | +- org.bouncycastle:bcmail-jdk15on:jar:1.60:compile [INFO] | | \- org.bouncycastle:bcpkix-jdk15on:jar:1.60:compile [INFO] | +- org.bouncycastle:bcprov-jdk15on:jar:1.60:compile [INFO] | +- org.apache.poi:poi:jar:4.0.0:compile [INFO] | | \- org.apache.commons:commons-collections4:jar:4.2:compile [INFO] | +- org.apache.poi:poi-scratchpad:jar:4.0.0:compile [INFO] | +- org.apache.poi:poi-ooxml:jar:4.0.0:compile [INFO] | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.0:compile [INFO] | | | \- org.apache.xmlbeans:xmlbeans:jar:3.0.1:compile [INFO] | | \- com.github.virtuald:curvesapi:jar:1.04:compile [INFO] | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile [INFO] | +- org.ow2.asm:asm:jar:6.2:compile [INFO] | +- com.googlecode.mp4parser:isoparser:jar:1.1.22:compile [INFO] | +- com.drewnoakes:metadata-extractor:jar:2.11.0:compile [INFO] | | \- com.adobe.xmp:xmpcore:jar:5.1.3:compile [INFO] | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile [INFO] | +- com.rometools:rome:jar:1.5.1:compile [INFO] | | \- com.rometools:rome-utils:jar:1.5.1:compile [INFO] | +- org.gagravarr:vorbis-java-core:jar:0.8:compile [INFO] | +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile [INFO] | +- org.codelibs:jhighlight:jar:1.0.3:compile [INFO] | +- com.pff:java-libpst:jar:0.8.1:compile [INFO] | +- com.github.junrar:junrar:jar:2.0.0:compile [INFO] | +- org.apache.cxf:cxf-rt-rs-client:jar:3.2.6:compile [INFO] | | +- org.apache.cxf:cxf-rt-transports-http:jar:3.2.6:compile [INFO] | | +- org.apache.cxf:cxf-core:jar:3.2.6:compile [INFO] | | | +- com.fasterxml.woodstox:woodstox-core:jar:5.1.0:compile [INFO] | | | | \- org.codehaus.woodstox:stax2-api:jar:4.1:compile [INFO] | | | \- org.apache.ws.xmlschema:xmlschema-core:jar:2.2.3:compile [INFO] | | \- org.apache.cxf:cxf-rt-frontend-jaxrs:jar:3.2.6:compile [INFO] | +- org.apache.commons:commons-exec:jar:1.3:compile [INFO] | +- org.apache.opennlp:opennlp-tools:jar:1.9.0:compile [INFO] | +- commons-io:commons-io:jar:2.6:compile [INFO] | +- com.googlecode.json-simple:json-simple:jar:1.1.1:compile [INFO] | +- com.github.openjson:openjson:jar:1.0.10:compile [INFO] | +- com.google.code.gson:gson:jar:2.8.5:compile [INFO] | +- edu.ucar:netcdf4:jar:4.5.5:compile [INFO] | | \- net.jcip:jcip-annotations:jar:1.0:compile [INFO] | +- edu.ucar:grib:jar:4.5.5:compile [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [INFO] | | \- org.itadaki:bzip2:jar:0.9.1:compile [INFO] | +- net.java.dev.jna:jna:jar:4.3.0:compile [INFO] | +- org.jsoup:jsoup:jar:1.11.3:compile [INFO] | +- edu.ucar:cdm:jar:4.5.5:compile [INFO] | | +- edu.ucar:udunits:jar:4.5.5:compile [INFO] | | +- joda-time:joda-time:jar:2.2:compile [INFO] | | +- org.quartz-scheduler:quartz:jar:2.2.0:compile [INFO] | | | \- c3p0:c3p0:jar:0.9.1.1:compile [INFO] | | +- net.sf.ehcache:ehcache-core:jar:2.6.2:compile [INFO] | | \- com.beust:jcommander:jar:1.35:compile [INFO] | +- edu.ucar:httpservices:jar:4.5.5:compile [INFO] | +- org.apache.httpcomponents:httpclient:jar:4.5.6:compile [INFO] | +- org.apache.httpcomponents:httpmime:jar:4.5.6:compile [INFO] | +- org.apache.commons:commons-csv:jar:1.5:compile [INFO] | +- org.apache.sis.core:sis-utility:jar:0.8:compile [INFO] | | \- javax.measure:unit-api:jar:1.0:compile [INFO] | +- org.apache.sis.storage:sis-netcdf:jar:0.8:compile [INFO] | | +- org.apache.sis.storage:sis-storage:jar:0.8:compile [INFO] | | | \- org.apache.sis.core:sis-feature:jar:0.8:compile [INFO] | | \- org.apache.sis.core:sis-referencing:jar:0.8:compile [INFO] | +- org.apache.sis.core:sis-metadata:jar:0.8:compile [INFO] | +- org.opengis:geoapi:jar:3.0.1:compile [INFO] | +- edu.usc.ir:sentiment-analysis-parser:jar:0.1:compile [INFO] | +- org.apache.uima:uimafit-core:jar:2.2.0:compile [INFO] | +- org.apache.uima:uimaj-core:jar:2.9.0:compile [INFO] | +- org.jdom:jdom2:jar:2.0.6:compile [INFO] | +- com.fasterxml.jackson.core:jackson-core:jar:2.9.6:compile [INFO] | +- com.fasterxml.jackson.core:jackson-annotations:jar:2.9.6:compile [INFO] | +- org.apache.pdfbox:jbig2-imageio:jar:3.0.1:compile [INFO] | \- com.github.jai-imageio:jai-imageio-core:jar:1.4.0:compile [INFO] +- javax.ws.rs:javax.ws.rs-api:jar:2.1.1:compile [INFO] \- org.scala-lang:scala-library:jar:2.11.12:compile

 

 

 

 

 

 

 

 

 

8 REPLIES 8

avatar
Contributor

if it occurred when you do the spark-submit, I think you cuold add "--jars  the jar's absolute path" just like "--jars  /a/b/c.jar" in your shell, then try to submit it again.

avatar
New Contributor

Hi,

 

Thx for the reply.  Unfortunately that is one of the things I've tried (see original post).  I've tried --jars with both HDFS and local locations.  I think this is a bigger issue from my testing -- I suspect it's a classpath issue that is getting overruled by Spark's required classes / libraries.  In my original post, I mention trying the --conf flags and posted a portion of the resulting errors -- essentially, I'm using an uber-jar, passing that uber-jar that has been verified that the the dependencies in question (commons-compress) are up to date within my uber-jar.  The "nosuchmethod error" occurs, so I try the --conf flags.  Doing so causes Spark to crash on startup (see error at top of original post) since I'm overruling Spark's classpath for its required methods.

 

I should've named this to something like -- how to resolve commons-compress library dependency for Spark.  

 

I think this snippet from another post is telling:

 

avatar
Contributor

if you use CDH, could check the spark version on the WebUI, as my side, I think it should be the version not match issue

avatar
New Contributor

I'm dealing with the same issue. Did you ever find a solution?

avatar
New Contributor

I found a solution, but I don't understand why it works.

 

In our project we were previously using Tika 1.12; I encountered the NoSuchMethodError when we upgraded to Tika 1.19.1. When I compared the dependency trees for builds with these two versions of Tika to see how commons-compress was being included, the only structural difference I found was that the new version of Tika introduced a transitive dependency on org.apache.poi.ooxml:

 

[INFO] |  +- org.apache.poi:poi-ooxml:jar:4.0.0:compile
[INFO] |  |  +- (org.apache.poi:poi:jar:4.0.0:compile - omitted for duplicate)
[INFO] |  |  +- org.apache.poi:poi-ooxml-schemas:jar:4.0.0:compile
[INFO] |  |  |  \- org.apache.xmlbeans:xmlbeans:jar:3.0.1:compile
[INFO] |  |  +- (org.apache.commons:commons-compress:jar:1.18:compile - omitted for conflict with 1.4.1)
[INFO] |  |  \- com.github.virtuald:curvesapi:jar:1.04:compile

(Our pom.xml specifies the dependency on commons-compress 1.18; Hadoop 2.6.5 libraries have the dependency on commons-compress 1.4.1) I don't see why poi-ooxml's dependency on commons-compress would prevent the inclusion of commons-compress 1.18, but that is what seems to be happening. When I exclude poi-ooxml from tika-parsers, calls to the Tika parser in spark-shell work as expected.

avatar
New Contributor

Hi jeremyw, I am using the same versions of Tika and getting same dependency conflicts.

 

There are few observations:

1. Dependency tree comparison shows many differences and not only org.apache.poi.ooxml as mentioned by you.

2. Excluding org.apache.poi.ooxml dependency from Tika-parsers, returns blank string after extraction.

 

Can you please reply if it actually worked in your case or you faced any more issues?

avatar
New Contributor

@DeepikaPant, I wrote up a more detailed analysis of the issue and a workaround here: https://github.com/archivesunleashed/aut/issues/308. The solution is to include a JAR containing an appropriate version of commons-compress with the --driver-class-path argument to spark-shell or spark-submit.

avatar
New Contributor

We need to set driver as well as executor class path in cluster mode.

 

Anyways your analysis is quite helpful and we have also used the same for now. Thanks. Keep sharing!