Created on 08-13-2015 02:14 AM - edited 09-16-2022 02:37 AM
Hello,
I'm trying to understand how the spark-submit is configured on the cloudera clusters. I'm using parcel-based installation and Spark on Yarn service.
On one machine (with spark Gateway role), where the spark-submit works, I see the following configuration (/etc/spark/conf/spark-env.sh):
#!/usr/bin/env bash ## # Generated by Cloudera Manager and should not be modified directly ## SELF="$(cd $(dirname $BASH_SOURCE) && pwd)" if [ -z "$SPARK_CONF_DIR" ]; then export SPARK_CONF_DIR="$SELF" fi export SPARK_HOME=/opt/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/spark export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/hadoop ### Path of Spark assembly jar in HDFS export SPARK_JAR_HDFS_PATH=${SPARK_JAR_HDFS_PATH:-''} export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME} if [ -n "$HADOOP_HOME" ]; then LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native fi SPARK_EXTRA_LIB_PATH="" if [ -n "$SPARK_EXTRA_LIB_PATH" ]; then LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SPARK_EXTRA_LIB_PATH fi export LD_LIBRARY_PATH export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf} # This is needed to support old CDH versions that use a forked version # of compute-classpath.sh. export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib # Set distribution classpath. This is only used in CDH 5.3 and later. export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt") /etc/spark/conf/spark-env.sh (END)
I also noticed that some nodes have spark-submit, that just throws "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream" exception. Here's a content of the spark-env.sh on that nodes:
### ### === IMPORTANT === ### Change the following to specify a real cluster's Master host ### export STANDALONE_SPARK_MASTER_HOST=`hostname` export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST ### Let's run everything with JVM runtime, instead of Scala export SPARK_LAUNCH_WITH_SCALA=0 export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib export SPARK_MASTER_WEBUI_PORT=18080 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_PORT=7078 export SPARK_WORKER_WEBUI_PORT=18081 export SPARK_WORKER_DIR=/var/run/spark/work export SPARK_LOG_DIR=/var/log/spark export SPARK_PID_DIR='/var/run/spark/' if [ -n "$HADOOP_HOME" ]; then export LD_LIBRARY_PATH=:/usr/lib/hadoop/lib/native fi export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf} if [[ -d $SPARK_HOME/python ]] then for i in do SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:$i done fi SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$SPARK_LIBRARY_PATH/spark-assembly.jar" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hive/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/flume-ng/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/paquet/lib/*" SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/avro/lib/*"
However, I'm not able to configure another node to have the functional spark-submit. Adding a "Spark Gateway" role to another node does not work. The command
is not added to the machine nor the configuration is fixed for the broken ones.
I also tried to remove the Spark service and add it back, after that, all nodes have the broken configuration.
Thanks for any help!
Created 08-18-2015 09:56 PM
For any node that you want to use to submit Spark jobs from to the cluster you should make it a Spark gateway.
That should push all the required configuration out to the node. There is no change to the submit scripts or installed code when you create a gateway.
When you make it a Spark gateway all jars and config should be pushed out correctly. If it is not a Spark gateway the default config is on the node which as you noticed does not work without making some changes.
In CM before 5.4 you also need to make sure that the correct YARN configuration is available on the node. CM 5.4 does that for you.
Wilfred
Created 08-18-2015 09:56 PM
For any node that you want to use to submit Spark jobs from to the cluster you should make it a Spark gateway.
That should push all the required configuration out to the node. There is no change to the submit scripts or installed code when you create a gateway.
When you make it a Spark gateway all jars and config should be pushed out correctly. If it is not a Spark gateway the default config is on the node which as you noticed does not work without making some changes.
In CM before 5.4 you also need to make sure that the correct YARN configuration is available on the node. CM 5.4 does that for you.
Wilfred
Created 08-19-2015 12:36 AM
We are using CDH 5.4.3 and what you described seems to work only when creating the cluster (though I have to verify this). When adding the Spark Gateway role to another node, this does not seem to work. Only the original Spark Gateway works. That is, unless the role is taken away from it. After that, I found no way how to restore the configuration and get at least one functional node with spark/yarn configuration.
Anyway, this seems like a bug in CM, so I guess I should file an issue.
Created 08-19-2015 12:41 AM
... or at least I would file an issue, but I could not find any issue tracker or submission form on cloudera sites.
Created 08-19-2015 12:45 AM
You would need to contact Cloudera Support if you believe it's a problem, if you have a support contract.
I have successfully added Spark Gateway nodes after a cluster is live though without issues, so I suspect it's something else at work here.
Created 09-16-2015 06:34 AM
In the case that the classpath obtained from classpath.txt is given only on the gateway machine, and the worker nodes will use the default configuration (so, no classpath.txt),
how the classpath will be synchronized on the whole application?
Should I set all worker nodes as gateways then ? o_O
André
Created 09-21-2015 11:36 PM
The action of adding a gateway role for Spark on a new machine managed by CM I do on a regular basis for different versions of CM and have never had a problem with submitting an application from a node like that.
The classpath for the application is part of the submitted application context and not based on the executor path. How would you otherwise add classes to the classpath that are application specific?
Wilfred
Created 09-22-2015 03:45 AM
Hi,
I'm creating an uber jar containing all application specific libs and dependencies. However my problem comes from a conflict of libs on classpath.txt (it contains conflicting versions of jackson - 2.2.3 and 2.3.1). My application uses jackson as 2.3.1 as dependency but somehow the wrong version of jackson on the classpath is used.
My idea was to modify the classpath.txt and deploy it on all worker nodes. Though this file is used only by Gateway nodes. i'm a bit puzzled with this lib conflict since in general, the classpath shound't contain several versions of the same lib. Shouldn't this be better handled on the classpath.txt?
Created 09-22-2015 03:47 AM
Created 09-22-2015 04:05 AM
To be completely in control I often recommend to use a shading tool for libraries like this.
Using maven shade or gradle shadow to make sure that your code references your version is a shure fire way to get this working.
When you build your project you "shade" the references in your code which means it always uses your version.
Wilfred