Created on 12-16-2014 11:03 AM - edited 09-16-2022 02:15 AM
Hi all,
I am running a CDH5.2 cluster including Spark on YARN. When I run jobs through spark-shell with a local driver I am able to read and process Snappy compressed files, however as soon as I try to run the same scripts (wordcount for testing purposes) on YARN I get an UnsatisfiedLinkError (see below):
java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy(Native Method) org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:63) org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:190) org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:176) org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:110) org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67) org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:198) org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:189) org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:98) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745)
I have tried to set the library path to libsnappy.so.1 with a plethora of variables including LD_LIBRARY_PATH, JAVA_LIBRARY_PATH, SPARK_LIBRARY_PATH in spark-env.sh, and hadoop-env.sh, as well as spark.executor.extraLibraryPath, spark.executor.extraClassPath in spark-defaults.conf.
I am at a loss as to what could be causing this problem since running locally works perfectly.
Any pointers/ideas would be really helpful.
Created 12-16-2014 03:25 PM
The solution I found was to add the following environment variables to spark-env.sh. The first 2 lines make spark-shell able to read snappy files from when run in local mode and the third makes it possible for spark-shell to read snappy files when in yarn mode.
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/usr/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/hadoop/lib/native export SPARK_YARN_USER_ENV="JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH,LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
Created 12-16-2014 03:25 PM
The solution I found was to add the following environment variables to spark-env.sh. The first 2 lines make spark-shell able to read snappy files from when run in local mode and the third makes it possible for spark-shell to read snappy files when in yarn mode.
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/usr/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/hadoop/lib/native export SPARK_YARN_USER_ENV="JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH,LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
Created 12-23-2014 02:54 AM
You can include the below in spark-defaults.conf
spark.driver.extraLibraryPath native path($HADOOP_HOME/lib/native/)
Created 12-23-2014 06:21 AM
I tried that. It didn't work.
Created 05-05-2015 01:05 PM
I saw Hadoop load native libraries in the running user's home onto the classpath. Maybe the same thing is happening to you with Spark. Check your home for
ls ~/lib* libhadoop.a libhadoop.so libhadooputils.a libsnappy.so libsnappy.so.1.1.3 libhadooppipes.a libhadoop.so.1.0.0 libhdfs.a libsnappy.so.1
and delete them if found. I could be totally off, but this was the culprit in our case.