Created on 09-25-2018 10:50 AM - edited 09-16-2022 06:44 AM
Hi,
I am trying to resolve a spark-submit classpath runtime issue for an Apache Tika (>v 1.14) parsing job. The problem seems to involve spark-submit classpath vs my uber-jar.
Platforms: CDH 5.15 (Spark 2.3 added via CDH docs) and CDH 6 (Spark 2.2 bundled in CDH 6)
I've tried / reviewed:
(Cloudera) Where does spark-submit look for Jar files?
(stackoverflow) resolving-dependency-problems-in-apache-spark
Highlights:
$ spark-submit --master local[*] --class com.example.App --conf spark.executor.userClassPathFirst=true ./target/uber-tikaTest-1.19.jar
18/09/25 13:35:55 ERROR util.Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:72) at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70) at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1307) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2136) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/09/25 13:35:55 ERROR util.Utils: Exception encountered java.lang.NullPointerException at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply$mcV$sp(ParallelCollectionRDD.scala:72) at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70) at org.apache.spark.rdd.ParallelCollectionPartition$$anonfun$readObject$1.apply(ParallelCollectionRDD.scala:70) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1307) at org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2136) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Below the following error message are files for:
The error at runtime:
18/09/25 11:47:39 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1) java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String; at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160) at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159) at com.example.App$.tikaAutoDetectParser(App.scala:55) at com.example.App$$anonfun$1.apply(App.scala:69) at com.example.App$$anonfun$1.apply(App.scala:69) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1799) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/09/25 11:47:39 ERROR executor.Executor: Exception in task 5.0 in stage 0.0 (TID 5) java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamFactory.detect(Ljava/io/InputStream;)Ljava/lang/String; at org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:160) at org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:104) at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:116) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159) at com.example.App$.tikaAutoDetectParser(App.scala:55) at com.example.App$$anonfun$1.apply(App.scala:69) at com.example.App$$anonfun$1.apply(App.scala:69) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1799) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1158) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2071) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
build-and-run.sh:
Notes:
#!/bin/bash mvn compile if true then spark-submit --master local[*] --class com.example.App ./target/uber-tikaTest-1.19.jar fi # tried the using the userClass flags for driver and executor for above and below calls to spark-submit # --conf spark.driver.userClassPathFirst=true \ # --conf spark.executor.userClassPathFirst=true \ if false then spark-submit --class com.example.App \ --master yarn \ --packages org.apache.commons:commons-compress:1.18 \ --jars ./target/uber-tikaTest-1.19.jar \ --num-executors 2 \ --executor-memory 1024m \ --executor-cores 2 \ --driver-memory 2048m \ --driver-cores 1 \ ./target/uber-tikaTest-1.19.jar fi
Sample App:
package com.example ////////// Tika Imports import org.apache.tika.metadata.Metadata import org.apache.tika.parser.AutoDetectParser import org.apache.tika.sax.BodyContentHandler ////////// Java HTTP Imports import java.net.URL; import java.net.HttpURLConnection import scala.collection.JavaConverters._ import scala.collection.mutable._ ////////// Spark Imports import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.storage.StorageLevel import org.apache.spark.sql.{Row,SparkSession} object App { case class InputStreamData(sourceURL: String, headerFields: Map[String,List[String]], inputStream: java.io.InputStream) def openUrlStream(sourceURL:String,apiKey:String):(InputStreamData) = { try { val url = new URL(sourceURL) val urlConnection = url.openConnection().asInstanceOf[HttpURLConnection] urlConnection.setInstanceFollowRedirects(true) val headerFields = urlConnection.getHeaderFields() val input = urlConnection.getInputStream() InputStreamData(sourceURL, headerFields.asScala.map(x => (x._1,x._2.asScala.toList)), input) } catch { case e: Exception => { println("**********************************************************************************************") println("PARSEURL: INVALID URL: " + sourceURL) println(e.toString()) println("**********************************************************************************************") InputStreamData(sourceURL, Map("ERROR" -> List("ERROR")), null) } } } def tikaAutoDetectParser(inputStream:java.io.InputStream):String = { var parser = new AutoDetectParser(); var handler = new BodyContentHandler(-1); var metadata = new Metadata(); parser.parse(inputStream, handler, metadata); return handler.toString() } def main(args : Array[String]) { var sparkConf = new SparkConf().setAppName("tika-1.19-test") val sc = new SparkContext(sparkConf) val spark = SparkSession.builder.config(sparkConf).getOrCreate() println("HELLO!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!") var urls = List("http://www.pdf995.com/samples/pdf.pdf", "https://www.amd.com/en", "http://jeroen.github.io/images/testocr.png") var rdd = sc.parallelize(urls) var parsed = rdd.map(x => tikaAutoDetectParser(openUrlStream(x,"").inputStream)) println(parsed.count) } }
pom.xml (builds uber-jar):
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>tikaTest</artifactId> <version>1.19</version> <name>${project.artifactId}</name> <description>Testing tika 1.19 with CDH 6 and 5.x, Spark 2.x, Scala 2.11.x</description> <inceptionYear>2018</inceptionYear> <licenses> <license> <name>My License</name> <url>http://....</url> <distribution>repo</distribution> </license> </licenses> <repositories> <repository> <id>cloudera</id> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository> </repositories> <profiles> <profile> <id>scala-2.11.12</id> <activation> <activeByDefault>true</activeByDefault> </activation> <properties> <scalaVersion>2.11.12</scalaVersion> <scalaBinaryVersion>2.11.12</scalaBinaryVersion> </properties> <dependencies> <!-- ************************************************************************** --> <!-- GOOD DEPENDENCIES +++++++++++++++++++++++++++++++++++++ --> <!-- ************************************************************************** --> <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-compress --> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-compress</artifactId> <version>1.18</version> </dependency> <!-- *************** CDH flavored dependencies ***********************************************--> <!-- https://www.cloudera.com/documentation/spark2/latest/topics/spark2_packaging.html#versions --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.2.0.cloudera3</version> <!-- have tried scope provided / compile --> <!--<scope>provided</scope>--> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.0.cloudera3</version> <!-- have tried scope provided / compile --> <!--<scope>provided</scope>--> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core --> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.19</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers --> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.19</version> </dependency> <!-- https://mvnrepository.com/artifact/javax.ws.rs/javax.ws.rs-api --> <dependency> <groupId>javax.ws.rs</groupId> <artifactId>javax.ws.rs-api</artifactId> <version>2.1.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library --> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>2.11.12</version> </dependency> <!-- ************************************************************************************************************************** **************************** alternative dependencies that have been tried and yield same Tika error*************************** *******************************************************************************************************************************--> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <!-- <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.2.0</version> </dependency> --> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql --> <!-- <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.2.0</version> </dependency> --> </dependencies> </profile> </profiles> <build> <sourceDirectory>src/main/scala</sourceDirectory> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>3.5.1</version> <configuration> <source>1.8</source> <target>1.8</target> </configuration> </plugin> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.2</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> <configuration> <args> <!-- work-around for https://issues.scala-lang.org/browse/SI-8358 --> <arg>-nobootcp</arg> </args> </configuration> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.1.1</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <finalName>uber-${project.artifactId}-${project.version}</finalName> </configuration> </plugin> </plugins> </build> </project>
mvn dependency tree:
Notes:
[INFO] +- org.apache.commons:commons-compress:jar:1.18:compile [INFO] | +- com.ning:compress-lzf:jar:1.0.3:compile
[INFO] +- org.apache.commons:commons-compress:jar:1.18:compile [INFO] | | | \- commons-collections:commons-collections:jar:3.2.2:compile [INFO] | | | +- commons-cli:commons-cli:jar:1.2:compile [INFO] | | | +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] | | | +- commons-configuration:commons-configuration:jar:1.6:compile [INFO] | | | | +- commons-digester:commons-digester:jar:1.8:compile [INFO] | | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile [INFO] | | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile [INFO] | +- org.apache.commons:commons-lang3:jar:3.5:compile [INFO] | +- org.apache.commons:commons-math3:jar:3.4.1:compile [INFO] | +- commons-net:commons-net:jar:2.2:compile [INFO] | +- org.apache.commons:commons-crypto:jar:1.0.0:compile [INFO] | | +- org.codehaus.janino:commons-compiler:jar:3.0.8:compile [INFO] | | \- commons-lang:commons-lang:jar:2.6:compile [INFO] | +- commons-codec:commons-codec:jar:1.11:compile [INFO] | | \- org.apache.commons:commons-collections4:jar:4.2:compile [INFO] | +- org.apache.commons:commons-exec:jar:1.3:compile [INFO] | +- commons-io:commons-io:jar:2.6:compile [INFO] | +- org.apache.commons:commons-csv:jar:1.5:compile
$ mvn dependency:tree -Ddetail=true
[INFO] com.example:tikaTest:jar:1.19 [INFO] +- org.apache.commons:commons-compress:jar:1.18:compile [INFO] +- org.apache.spark:spark-core_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.avro:avro:jar:1.7.6-cdh5.13.3:compile [INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile [INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] | | \- com.thoughtworks.paranamer:paranamer:jar:2.3:compile [INFO] | +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6-cdh5.13.3:compile [INFO] | | +- org.apache.avro:avro-ipc:jar:1.7.6-cdh5.13.3:compile [INFO] | | | \- commons-collections:commons-collections:jar:3.2.2:compile [INFO] | | \- org.apache.avro:avro-ipc:jar:tests:1.7.6-cdh5.13.3:compile [INFO] | +- com.twitter:chill_2.11:jar:0.8.0:compile [INFO] | | \- com.esotericsoftware:kryo-shaded:jar:3.0.3:compile [INFO] | | +- com.esotericsoftware:minlog:jar:1.3.0:compile [INFO] | | \- org.objenesis:objenesis:jar:2.1:compile [INFO] | +- com.twitter:chill-java:jar:0.8.0:compile [INFO] | +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:compile [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.13.3:compile [INFO] | | +- org.apache.hadoop:hadoop-common:jar:2.6.0-cdh5.13.3:compile [INFO] | | | +- commons-cli:commons-cli:jar:1.2:compile [INFO] | | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | | +- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] | | | +- commons-configuration:commons-configuration:jar:1.6:compile [INFO] | | | | +- commons-digester:commons-digester:jar:1.8:compile [INFO] | | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile [INFO] | | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile [INFO] | | | +- org.apache.hadoop:hadoop-auth:jar:2.6.0-cdh5.13.3:compile [INFO] | | | | \- org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:compile [INFO] | | | | +- org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:compile [INFO] | | | | +- org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:compile [INFO] | | | | \- org.apache.directory.api:api-util:jar:1.0.0-M20:compile [INFO] | | | +- org.apache.curator:curator-client:jar:2.7.1:compile [INFO] | | | \- org.apache.htrace:htrace-core4:jar:4.0.1-incubating:compile [INFO] | | +- org.apache.hadoop:hadoop-hdfs:jar:2.6.0-cdh5.13.3:compile [INFO] | | | +- org.mortbay.jetty:jetty-util:jar:6.1.26.cloudera.4:compile [INFO] | | | \- xerces:xercesImpl:jar:2.9.1:compile [INFO] | | | \- xml-apis:xml-apis:jar:1.3.04:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.6.0-cdh5.13.3:compile [INFO] | | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.6.0-cdh5.13.3:compile [INFO] | | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.6.0-cdh5.13.3:compile [INFO] | | | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.6.0-cdh5.13.3:compile [INFO] | | | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.6.0-cdh5.13.3:compile [INFO] | | +- org.apache.hadoop:hadoop-yarn-api:jar:2.6.0-cdh5.13.3:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.6.0-cdh5.13.3:compile [INFO] | | | \- org.apache.hadoop:hadoop-yarn-common:jar:2.6.0-cdh5.13.3:compile [INFO] | | | +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.8:compile [INFO] | | | \- org.codehaus.jackson:jackson-xc:jar:1.8.8:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.6.0-cdh5.13.3:compile [INFO] | | +- org.apache.hadoop:hadoop-aws:jar:2.6.0-cdh5.13.3:compile [INFO] | | | \- com.amazonaws:aws-java-sdk-bundle:jar:1.11.134:compile [INFO] | | \- org.apache.hadoop:hadoop-annotations:jar:2.6.0-cdh5.13.3:compile [INFO] | +- org.apache.spark:spark-launcher_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.spark:spark-network-common_2.11:jar:2.2.0.cloudera3:compile [INFO] | | \- org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile [INFO] | +- org.apache.spark:spark-network-shuffle_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.spark:spark-unsafe_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.9.0:compile [INFO] | | +- org.apache.httpcomponents:httpcore:jar:4.1.2:compile [INFO] | | \- com.jamesmurty.utils:java-xmlbuilder:jar:0.4:compile [INFO] | +- org.apache.curator:curator-recipes:jar:2.7.1:compile [INFO] | | +- org.apache.curator:curator-framework:jar:2.7.1:compile [INFO] | | +- org.apache.zookeeper:zookeeper:jar:3.4.6:compile [INFO] | | \- com.google.guava:guava:jar:16.0.1:compile [INFO] | +- javax.servlet:javax.servlet-api:jar:3.1.0:compile [INFO] | +- org.apache.commons:commons-lang3:jar:3.5:compile [INFO] | +- org.apache.commons:commons-math3:jar:3.4.1:compile [INFO] | +- com.google.code.findbugs:jsr305:jar:1.3.9:compile [INFO] | +- org.slf4j:slf4j-api:jar:1.7.5:compile [INFO] | +- org.slf4j:jul-to-slf4j:jar:1.7.5:compile [INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.5:compile [INFO] | +- log4j:log4j:jar:1.2.17:compile [INFO] | +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile [INFO] | +- com.ning:compress-lzf:jar:1.0.3:compile [INFO] | +- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile [INFO] | +- net.jpountz.lz4:lz4:jar:1.3.0:compile [INFO] | +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:compile [INFO] | +- commons-net:commons-net:jar:2.2:compile [INFO] | +- org.json4s:json4s-jackson_2.11:jar:3.2.11:compile [INFO] | | \- org.json4s:json4s-core_2.11:jar:3.2.11:compile [INFO] | | +- org.json4s:json4s-ast_2.11:jar:3.2.11:compile [INFO] | | \- org.scala-lang:scalap:jar:2.11.0:compile [INFO] | | \- org.scala-lang:scala-compiler:jar:2.11.0:compile [INFO] | | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.1:compile [INFO] | | \- org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.1:compile [INFO] | +- org.glassfish.jersey.core:jersey-client:jar:2.22.2:compile [INFO] | | +- org.glassfish.hk2:hk2-api:jar:2.4.0-b34:compile [INFO] | | | +- org.glassfish.hk2:hk2-utils:jar:2.4.0-b34:compile [INFO] | | | \- org.glassfish.hk2.external:aopalliance-repackaged:jar:2.4.0-b34:compile [INFO] | | +- org.glassfish.hk2.external:javax.inject:jar:2.4.0-b34:compile [INFO] | | \- org.glassfish.hk2:hk2-locator:jar:2.4.0-b34:compile [INFO] | | \- org.javassist:javassist:jar:3.18.1-GA:compile [INFO] | +- org.glassfish.jersey.core:jersey-common:jar:2.22.2:compile [INFO] | | +- javax.annotation:javax.annotation-api:jar:1.2:compile [INFO] | | +- org.glassfish.jersey.bundles.repackaged:jersey-guava:jar:2.22.2:compile [INFO] | | \- org.glassfish.hk2:osgi-resource-locator:jar:1.0.1:compile [INFO] | +- org.glassfish.jersey.core:jersey-server:jar:2.22.2:compile [INFO] | | +- org.glassfish.jersey.media:jersey-media-jaxb:jar:2.22.2:compile [INFO] | | \- javax.validation:validation-api:jar:1.1.0.Final:compile [INFO] | +- org.glassfish.jersey.containers:jersey-container-servlet:jar:2.22.2:compile [INFO] | +- org.glassfish.jersey.containers:jersey-container-servlet-core:jar:2.22.2:compile [INFO] | +- io.netty:netty-all:jar:4.0.43.Final:compile [INFO] | +- io.netty:netty:jar:3.9.9.Final:compile [INFO] | +- com.clearspring.analytics:stream:jar:2.7.0:compile [INFO] | +- io.dropwizard.metrics:metrics-core:jar:3.1.2:compile [INFO] | +- io.dropwizard.metrics:metrics-jvm:jar:3.1.2:compile [INFO] | +- io.dropwizard.metrics:metrics-json:jar:3.1.2:compile [INFO] | +- io.dropwizard.metrics:metrics-graphite:jar:3.1.2:compile [INFO] | +- com.fasterxml.jackson.core:jackson-databind:jar:2.6.5:compile [INFO] | +- com.fasterxml.jackson.module:jackson-module-scala_2.11:jar:2.6.5:compile [INFO] | | +- org.scala-lang:scala-reflect:jar:2.11.7:compile [INFO] | | \- com.fasterxml.jackson.module:jackson-module-paranamer:jar:2.6.5:compile [INFO] | +- org.apache.ivy:ivy:jar:2.4.0:compile [INFO] | +- oro:oro:jar:2.0.8:compile [INFO] | +- net.razorvine:pyrolite:jar:4.13:compile [INFO] | +- net.sf.py4j:py4j:jar:0.10.7:compile [INFO] | +- org.apache.spark:spark-tags_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.commons:commons-crypto:jar:1.0.0:compile [INFO] | \- org.spark-project.spark:unused:jar:1.0.0:compile [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- com.univocity:univocity-parsers:jar:2.2.1:compile [INFO] | +- org.apache.spark:spark-sketch_2.11:jar:2.2.0.cloudera3:compile [INFO] | +- org.apache.spark:spark-catalyst_2.11:jar:2.2.0.cloudera3:compile [INFO] | | +- org.codehaus.janino:janino:jar:3.0.8:compile [INFO] | | +- org.codehaus.janino:commons-compiler:jar:3.0.8:compile [INFO] | | \- org.antlr:antlr4-runtime:jar:4.5.3:compile [INFO] | +- com.twitter:parquet-column:jar:1.5.0-cdh5.13.3:compile [INFO] | | +- com.twitter:parquet-common:jar:1.5.0-cdh5.13.3:compile [INFO] | | \- com.twitter:parquet-encoding:jar:1.5.0-cdh5.13.3:compile [INFO] | +- com.twitter:parquet-hadoop:jar:1.5.0-cdh5.13.3:compile [INFO] | | +- com.twitter:parquet-format:jar:2.1.0-cdh5.13.3:compile [INFO] | | \- com.twitter:parquet-jackson:jar:1.5.0-cdh5.13.3:compile [INFO] | \- com.twitter:parquet-avro:jar:1.5.0-cdh5.13.3:compile [INFO] | \- it.unimi.dsi:fastutil:jar:7.2.1:compile [INFO] +- org.apache.tika:tika-core:jar:1.19:compile [INFO] +- org.apache.tika:tika-parsers:jar:1.19:compile [INFO] | +- javax.xml.bind:jaxb-api:jar:2.3.0:compile [INFO] | +- com.sun.xml.bind:jaxb-core:jar:2.3.0:compile [INFO] | +- com.sun.xml.bind:jaxb-impl:jar:2.3.0:compile [INFO] | +- javax.activation:activation:jar:1.1.1:compile [INFO] | +- org.gagravarr:vorbis-java-tika:jar:0.8:compile [INFO] | +- com.healthmarketscience.jackcess:jackcess:jar:2.1.12:compile [INFO] | | \- commons-lang:commons-lang:jar:2.6:compile [INFO] | +- com.healthmarketscience.jackcess:jackcess-encrypt:jar:2.1.4:compile [INFO] | +- org.tallison:jmatio:jar:1.4:compile [INFO] | +- org.apache.james:apache-mime4j-core:jar:0.8.2:compile [INFO] | +- org.apache.james:apache-mime4j-dom:jar:0.8.2:compile [INFO] | +- org.tukaani:xz:jar:1.8:compile [INFO] | +- com.epam:parso:jar:2.0.9:compile [INFO] | +- org.brotli:dec:jar:0.1.2:compile [INFO] | +- commons-codec:commons-codec:jar:1.11:compile [INFO] | +- org.apache.pdfbox:pdfbox:jar:2.0.11:compile [INFO] | | \- org.apache.pdfbox:fontbox:jar:2.0.11:compile [INFO] | +- org.apache.pdfbox:pdfbox-tools:jar:2.0.11:compile [INFO] | +- org.apache.pdfbox:jempbox:jar:1.8.15:compile [INFO] | +- org.bouncycastle:bcmail-jdk15on:jar:1.60:compile [INFO] | | \- org.bouncycastle:bcpkix-jdk15on:jar:1.60:compile [INFO] | +- org.bouncycastle:bcprov-jdk15on:jar:1.60:compile [INFO] | +- org.apache.poi:poi:jar:4.0.0:compile [INFO] | | \- org.apache.commons:commons-collections4:jar:4.2:compile [INFO] | +- org.apache.poi:poi-scratchpad:jar:4.0.0:compile [INFO] | +- org.apache.poi:poi-ooxml:jar:4.0.0:compile [INFO] | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.0:compile [INFO] | | | \- org.apache.xmlbeans:xmlbeans:jar:3.0.1:compile [INFO] | | \- com.github.virtuald:curvesapi:jar:1.04:compile [INFO] | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile [INFO] | +- org.ow2.asm:asm:jar:6.2:compile [INFO] | +- com.googlecode.mp4parser:isoparser:jar:1.1.22:compile [INFO] | +- com.drewnoakes:metadata-extractor:jar:2.11.0:compile [INFO] | | \- com.adobe.xmp:xmpcore:jar:5.1.3:compile [INFO] | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile [INFO] | +- com.rometools:rome:jar:1.5.1:compile [INFO] | | \- com.rometools:rome-utils:jar:1.5.1:compile [INFO] | +- org.gagravarr:vorbis-java-core:jar:0.8:compile [INFO] | +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile [INFO] | +- org.codelibs:jhighlight:jar:1.0.3:compile [INFO] | +- com.pff:java-libpst:jar:0.8.1:compile [INFO] | +- com.github.junrar:junrar:jar:2.0.0:compile [INFO] | +- org.apache.cxf:cxf-rt-rs-client:jar:3.2.6:compile [INFO] | | +- org.apache.cxf:cxf-rt-transports-http:jar:3.2.6:compile [INFO] | | +- org.apache.cxf:cxf-core:jar:3.2.6:compile [INFO] | | | +- com.fasterxml.woodstox:woodstox-core:jar:5.1.0:compile [INFO] | | | | \- org.codehaus.woodstox:stax2-api:jar:4.1:compile [INFO] | | | \- org.apache.ws.xmlschema:xmlschema-core:jar:2.2.3:compile [INFO] | | \- org.apache.cxf:cxf-rt-frontend-jaxrs:jar:3.2.6:compile [INFO] | +- org.apache.commons:commons-exec:jar:1.3:compile [INFO] | +- org.apache.opennlp:opennlp-tools:jar:1.9.0:compile [INFO] | +- commons-io:commons-io:jar:2.6:compile [INFO] | +- com.googlecode.json-simple:json-simple:jar:1.1.1:compile [INFO] | +- com.github.openjson:openjson:jar:1.0.10:compile [INFO] | +- com.google.code.gson:gson:jar:2.8.5:compile [INFO] | +- edu.ucar:netcdf4:jar:4.5.5:compile [INFO] | | \- net.jcip:jcip-annotations:jar:1.0:compile [INFO] | +- edu.ucar:grib:jar:4.5.5:compile [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [INFO] | | \- org.itadaki:bzip2:jar:0.9.1:compile [INFO] | +- net.java.dev.jna:jna:jar:4.3.0:compile [INFO] | +- org.jsoup:jsoup:jar:1.11.3:compile [INFO] | +- edu.ucar:cdm:jar:4.5.5:compile [INFO] | | +- edu.ucar:udunits:jar:4.5.5:compile [INFO] | | +- joda-time:joda-time:jar:2.2:compile [INFO] | | +- org.quartz-scheduler:quartz:jar:2.2.0:compile [INFO] | | | \- c3p0:c3p0:jar:0.9.1.1:compile [INFO] | | +- net.sf.ehcache:ehcache-core:jar:2.6.2:compile [INFO] | | \- com.beust:jcommander:jar:1.35:compile [INFO] | +- edu.ucar:httpservices:jar:4.5.5:compile [INFO] | +- org.apache.httpcomponents:httpclient:jar:4.5.6:compile [INFO] | +- org.apache.httpcomponents:httpmime:jar:4.5.6:compile [INFO] | +- org.apache.commons:commons-csv:jar:1.5:compile [INFO] | +- org.apache.sis.core:sis-utility:jar:0.8:compile [INFO] | | \- javax.measure:unit-api:jar:1.0:compile [INFO] | +- org.apache.sis.storage:sis-netcdf:jar:0.8:compile [INFO] | | +- org.apache.sis.storage:sis-storage:jar:0.8:compile [INFO] | | | \- org.apache.sis.core:sis-feature:jar:0.8:compile [INFO] | | \- org.apache.sis.core:sis-referencing:jar:0.8:compile [INFO] | +- org.apache.sis.core:sis-metadata:jar:0.8:compile [INFO] | +- org.opengis:geoapi:jar:3.0.1:compile [INFO] | +- edu.usc.ir:sentiment-analysis-parser:jar:0.1:compile [INFO] | +- org.apache.uima:uimafit-core:jar:2.2.0:compile [INFO] | +- org.apache.uima:uimaj-core:jar:2.9.0:compile [INFO] | +- org.jdom:jdom2:jar:2.0.6:compile [INFO] | +- com.fasterxml.jackson.core:jackson-core:jar:2.9.6:compile [INFO] | +- com.fasterxml.jackson.core:jackson-annotations:jar:2.9.6:compile [INFO] | +- org.apache.pdfbox:jbig2-imageio:jar:3.0.1:compile [INFO] | \- com.github.jai-imageio:jai-imageio-core:jar:1.4.0:compile [INFO] +- javax.ws.rs:javax.ws.rs-api:jar:2.1.1:compile [INFO] \- org.scala-lang:scala-library:jar:2.11.12:compile
Created on 09-30-2018 01:39 AM - edited 09-30-2018 01:40 AM
if it occurred when you do the spark-submit, I think you cuold add "--jars the jar's absolute path" just like "--jars /a/b/c.jar" in your shell, then try to submit it again.
Created 10-01-2018 06:21 AM
Hi,
Thx for the reply. Unfortunately that is one of the things I've tried (see original post). I've tried --jars with both HDFS and local locations. I think this is a bigger issue from my testing -- I suspect it's a classpath issue that is getting overruled by Spark's required classes / libraries. In my original post, I mention trying the --conf flags and posted a portion of the resulting errors -- essentially, I'm using an uber-jar, passing that uber-jar that has been verified that the the dependencies in question (commons-compress) are up to date within my uber-jar. The "nosuchmethod error" occurs, so I try the --conf flags. Doing so causes Spark to crash on startup (see error at top of original post) since I'm overruling Spark's classpath for its required methods.
I should've named this to something like -- how to resolve commons-compress library dependency for Spark.
I think this snippet from another post is telling:
Created 10-07-2018 05:55 PM
if you use CDH, could check the spark version on the WebUI, as my side, I think it should be the version not match issue
Created 02-12-2019 09:43 AM
I'm dealing with the same issue. Did you ever find a solution?
Created 02-12-2019 01:01 PM
I found a solution, but I don't understand why it works.
In our project we were previously using Tika 1.12; I encountered the NoSuchMethodError when we upgraded to Tika 1.19.1. When I compared the dependency trees for builds with these two versions of Tika to see how commons-compress was being included, the only structural difference I found was that the new version of Tika introduced a transitive dependency on org.apache.poi.ooxml:
[INFO] | +- org.apache.poi:poi-ooxml:jar:4.0.0:compile [INFO] | | +- (org.apache.poi:poi:jar:4.0.0:compile - omitted for duplicate) [INFO] | | +- org.apache.poi:poi-ooxml-schemas:jar:4.0.0:compile [INFO] | | | \- org.apache.xmlbeans:xmlbeans:jar:3.0.1:compile [INFO] | | +- (org.apache.commons:commons-compress:jar:1.18:compile - omitted for conflict with 1.4.1) [INFO] | | \- com.github.virtuald:curvesapi:jar:1.04:compile
(Our pom.xml specifies the dependency on commons-compress 1.18; Hadoop 2.6.5 libraries have the dependency on commons-compress 1.4.1) I don't see why poi-ooxml's dependency on commons-compress would prevent the inclusion of commons-compress 1.18, but that is what seems to be happening. When I exclude poi-ooxml from tika-parsers, calls to the Tika parser in spark-shell work as expected.
Created 03-05-2019 10:51 PM
Hi jeremyw, I am using the same versions of Tika and getting same dependency conflicts.
There are few observations:
1. Dependency tree comparison shows many differences and not only org.apache.poi.ooxml as mentioned by you.
2. Excluding org.apache.poi.ooxml dependency from Tika-parsers, returns blank string after extraction.
Can you please reply if it actually worked in your case or you faced any more issues?
Created 03-06-2019 11:56 AM
@DeepikaPant, I wrote up a more detailed analysis of the issue and a workaround here: https://github.com/archivesunleashed/aut/issues/308. The solution is to include a JAR containing an appropriate version of commons-compress with the --driver-class-path argument to spark-shell or spark-submit.
Created on 03-07-2019 08:23 AM - edited 03-07-2019 08:26 AM
We need to set driver as well as executor class path in cluster mode.
Anyways your analysis is quite helpful and we have also used the same for now. Thanks. Keep sharing!