Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Protobuf JAR dependency and CoreNLP

Highlighted

Protobuf JAR dependency and CoreNLP

New Contributor

Hi

 

We are trying to run a Spark 2.2.0 application using Standford CoreNLP version 3.9.1. The core NLP version requires proto buffer java version 3.5.1 but there a proto buffer java 2.5.0 version on CDH classpath and my application fails (similar to https://github.com/stanfordnlp/CoreNLP/issues/556). As a workaround, I set `spark.executor.userClassPathFirst=true` and have the required proto buffer java version as user supplied JAR file with --jars option to Spark. This seems to be working, but my parquet file write fails with 

Caused by: java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/nio/ByteBuffer;II)I
	at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
	at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:561)
	at parquet.hadoop.codec.SnappyDecompressor.decompress(SnappyDecompressor.java:62)
	at parquet.hadoop.codec.NonBlockedDecompressorStream.read(NonBlockedDecompressorStream.java:51)
	at java.io.DataInputStream.readFully(DataInputStream.java:195)
	at java.io.DataInputStream.readFully(DataInputStream.java:169)
	at parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:204)
	at parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary.<init>(PlainValuesDictionary.java:89)
	at parquet.column.Encoding$1.initDictionary(Encoding.java:94)
	at parquet.column.Encoding$4.initDictionary(Encoding.java:149)


above error. (Similar to http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Override-libraries-for-spark/td-p/3...) Is there a workaround/solution for the problem. In summary

 

1. Is there any other way to overwrite the proto buffer version in Spark2 application run via Spark Submit on YARN?

2. Is there workaround / solution for snappy native link error? BTW the parquet write does work on the cluster if I don't have 'spark.executor.userClassPathFirst=true' in a Spark job (for a separate Spark application)

We are running CDH5.13.1

 

Thanks