About martins

martins · ‎09-22-2015

I did not modify classpath.txt since I don't want to touch files generated by cloudera. Without the userClassPathFirst switch, it works correctly. Unfortunately, I need the switch to replace a few other libraries that are already in the classpath.txt.

martins · ‎09-22-2015

I've just tried a simple application with nothing but a Main class in the jar file. This is the code: import org.apache.hadoop.hbase._ import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.mapreduce.TableInputFormat import org.apache.hadoop.hbase.mapreduce.TableOutputFormat import org.apache.spark._ import org.apache.spark.rdd.NewHadoopRDD object Main { def main(args: Array[String]) { val sc = new SparkContext(new SparkConf()) val conf = HBaseConfiguration.create() conf.set(HConstants.ZOOKEEPER_QUORUM, "my.server") //config.set("hbase.zookeeper.property.clientPort","2181"); val explicitNamespace: String = "BankData" val qualifiedTableName: String = explicitNamespace + ':' + "bank" conf.set(TableInputFormat.INPUT_TABLE, qualifiedTableName) conf.set(TableOutputFormat.OUTPUT_TABLE, qualifiedTableName) conf.set(TableInputFormat.INPUT_TABLE, "tmp") var hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) } } I run this with: spark-submit --conf "spark.driver.userClassPathFirst=true" --conf "spark.executor.userClassPathFirst=true" --class Main --master yarn-cluster test2.jar I got the following exception: 15/09/22 14:34:28 ERROR yarn.ApplicationMaster: User class threw exception: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:157) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:199) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:199) at scala.Option.map(Option.scala:145) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:199) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) at org.apache.spark.rdd.NewHadoopRDD.<init>(NewHadoopRDD.scala:77) at org.apache.spark.SparkContext.newAPIHadoopRDD(SparkContext.scala:878) at Main$.main(Main.scala:29) at Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480) I'm running CDH 5.4.3

martins · ‎09-22-2015

We have already tried setting the 2 "userClassPathFirst" switches, but unfortunately, we ended up with same strange exception: ERROR yarn.ApplicationMaster: User class threw exception: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I java.lang.UnsatisfiedLinkError: org.xerial.snappy.SnappyNative.maxCompressedLength(I)I at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method) at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:316) at org.xerial.snappy.SnappyOutputStream.<init>(SnappyOutputStream.java:79) at org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:157) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:199) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$4.apply(TorrentBroadcast.scala:199) at scala.Option.map(Option.scala:145) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:199) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101) at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) at org.apache.spark.rdd.NewHadoopRDD.<init>(NewHadoopRDD.scala:77) at org.apache.spark.SparkContext.newAPIHadoopRDD(SparkContext.scala:878) at org.apache.spark.api.java.JavaSparkContext.newAPIHadoopRDD(JavaSparkContext.scala:516) ... What I don't understand is why there is SPARK_DIST_CLASSPATH being set at all. Vanilla spark installation has everything in the single jar and any additional dependency must be specified explicitely, correct? Would it be possible to completely replace the classpath.txt with user provided dependencies? Thanks!

martins · ‎09-22-2015

Hello, I would like to use newer version of some of the libraries listed in /etc/spark/conf/classpath.txt. What is the recommended way to do that? I add other libraries using spark-submit's --jars (I have the jars on HDFS), but this does not work with newer versions of libraries that are already in classpath.txt. Alternatively, is there a way to disable construction of classpath.txt and rely solely on libraries provided to the spark-submit (except spark and hadoop possibly)? I'm running spark on yarn (cluster mode). Thank you!

martins · ‎08-19-2015

... or at least I would file an issue, but I could not find any issue tracker or submission form on cloudera sites.

martins · ‎08-19-2015

We are using CDH 5.4.3 and what you described seems to work only when creating the cluster (though I have to verify this). When adding the Spark Gateway role to another node, this does not seem to work. Only the original Spark Gateway works. That is, unless the role is taken away from it. After that, I found no way how to restore the configuration and get at least one functional node with spark/yarn configuration. Anyway, this seems like a bug in CM, so I guess I should file an issue.

martins · ‎08-13-2015

Hello, I'm trying to understand how the spark-submit is configured on the cloudera clusters. I'm using parcel-based installation and Spark on Yarn service. On one machine (with spark Gateway role), where the spark-submit works, I see the following configuration (/etc/spark/conf/spark-env.sh): #!/usr/bin/env bash ## # Generated by Cloudera Manager and should not be modified directly ## SELF="$(cd $(dirname $BASH_SOURCE) && pwd)" if [ -z "$SPARK_CONF_DIR" ]; then export SPARK_CONF_DIR="$SELF" fi export SPARK_HOME=/opt/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/spark export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/hadoop ### Path of Spark assembly jar in HDFS export SPARK_JAR_HDFS_PATH=${SPARK_JAR_HDFS_PATH:-''} export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME} if [ -n "$HADOOP_HOME" ]; then LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native fi SPARK_EXTRA_LIB_PATH="" if [ -n "$SPARK_EXTRA_LIB_PATH" ]; then LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SPARK_EXTRA_LIB_PATH fi export LD_LIBRARY_PATH export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf} # This is needed to support old CDH versions that use a forked version # of compute-classpath.sh. export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib # Set distribution classpath. This is only used in CDH 5.3 and later. export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt") /etc/spark/conf/spark-env.sh (END) I also noticed that some nodes have spark-submit, that just throws "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream" exception. Here's a content of the spark-env.sh on that nodes: ### ### === IMPORTANT === ### Change the following to specify a real cluster's Master host ### export STANDALONE_SPARK_MASTER_HOST=`hostname` export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST ### Let's run everything with JVM runtime, instead of Scala export SPARK_LAUNCH_WITH_SCALA=0 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SPARK_MASTER_WEBUI_PORT=18080 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_PORT=7078 export SPARK_WORKER_WEBUI_PORT=18081 export SPARK_WORKER_DIR=/var/run/spark/work export SPARK_LOG_DIR=/var/log/spark export SPARK_PID_DIR='/var/run/spark/' if [ -n "$HADOOP_HOME" ]; then export LD_LIBRARY_PATH=:/usr/lib/hadoop/lib/native fi export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf} if [[ -d $SPARK_HOME/python ]] then for i in do SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:$i done fi SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$SPARK_LIBRARY_PATH/spark-assembly.jar" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hive/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/flume-ng/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/paquet/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/avro/lib/*" However, I'm not able to configure another node to have the functional spark-submit. Adding a "Spark Gateway" role to another node does not work. The command is not added to the machine nor the configuration is fixed for the broken ones. I also tried to remove the Spark service and add it back, after that, all nodes have the broken configuration. Thanks for any help!

Online	Offline
Last Visited	‎02-05-2016 01:24 AM

Member Since	‎08-13-2015 01:56 AM
Last Visited	‎02-05-2016 01:24 AM
Posts	9

Cloudera Community

Re: Override libraries for spark

Re: Override libraries for spark

Re: Override libraries for spark

Override libraries for spark

Re: spark-submit on additional machine

Re: spark-submit on additional machine

spark-submit on additional machine