Support Questions

martins · ‎08-13-2015

Hello,

I'm trying to understand how the spark-submit is configured on the cloudera clusters. I'm using parcel-based installation and Spark on Yarn service.

On one machine (with spark Gateway role), where the spark-submit works, I see the following configuration (/etc/spark/conf/spark-env.sh):

#!/usr/bin/env bash
##
# Generated by Cloudera Manager and should not be modified directly
##

SELF="$(cd $(dirname $BASH_SOURCE) && pwd)"
if [ -z "$SPARK_CONF_DIR" ]; then
  export SPARK_CONF_DIR="$SELF"
fi

export SPARK_HOME=/opt/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/spark
export DEFAULT_HADOOP_HOME=/opt/cloudera/parcels/CDH-5.4.3-1.cdh5.4.3.p0.6/lib/hadoop

### Path of Spark assembly jar in HDFS
export SPARK_JAR_HDFS_PATH=${SPARK_JAR_HDFS_PATH:-''}

export HADOOP_HOME=${HADOOP_HOME:-$DEFAULT_HADOOP_HOME}

if [ -n "$HADOOP_HOME" ]; then
  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi

SPARK_EXTRA_LIB_PATH=""
if [ -n "$SPARK_EXTRA_LIB_PATH" ]; then
  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SPARK_EXTRA_LIB_PATH
fi

export LD_LIBRARY_PATH
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf}

# This is needed to support old CDH versions that use a forked version
# of compute-classpath.sh.
export SCALA_LIBRARY_PATH=${SPARK_HOME}/lib

# Set distribution classpath. This is only used in CDH 5.3 and later.
export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt")
/etc/spark/conf/spark-env.sh (END)

I also noticed that some nodes have spark-submit, that just throws "java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream" exception. Here's a content of the spark-env.sh on that nodes:

###
### === IMPORTANT ===
### Change the following to specify a real cluster's Master host
###
export STANDALONE_SPARK_MASTER_HOST=`hostname`

export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

### Let's run everything with JVM runtime, instead of Scala
export SPARK_LAUNCH_WITH_SCALA=0
export SPARK_LIBRARY_PATH=${SPARK_HOME}/lib
export SPARK_MASTER_WEBUI_PORT=18080
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=18081
export SPARK_WORKER_DIR=/var/run/spark/work
export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR='/var/run/spark/'

if [ -n "$HADOOP_HOME" ]; then
  export LD_LIBRARY_PATH=:/usr/lib/hadoop/lib/native
fi

export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}

if [[ -d $SPARK_HOME/python ]]
then
    for i in 
    do
        SPARK_DIST_CLASSPATH=${SPARK_DIST_CLASSPATH}:$i
    done
fi

SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$SPARK_LIBRARY_PATH/spark-assembly.jar"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-hdfs/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-mapreduce/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hadoop-yarn/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/hive/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/flume-ng/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/paquet/lib/*"
SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/usr/lib/avro/lib/*"

However, I'm not able to configure another node to have the functional spark-submit. Adding a "Spark Gateway" role to another node does not work. The command

is not added to the machine nor the configuration is fixed for the broken ones.

I also tried to remove the Spark service and add it back, after that, all nodes have the broken configuration.

Thanks for any help!

Wilfred · ‎08-18-2015

For any node that you want to use to submit Spark jobs from to the cluster you should make it a Spark gateway.

That should push all the required configuration out to the node. There is no change to the submit scripts or installed code when you create a gateway.

When you make it a Spark gateway all jars and config should be pushed out correctly. If it is not a Spark gateway the default config is on the node which as you noticed does not work without making some changes.

In CM before 5.4 you also need to make sure that the correct YARN configuration is available on the node. CM 5.4 does that for you.

Wilfred

View solution in original post

Wilfred · ‎08-18-2015

For any node that you want to use to submit Spark jobs from to the cluster you should make it a Spark gateway.

That should push all the required configuration out to the node. There is no change to the submit scripts or installed code when you create a gateway.

When you make it a Spark gateway all jars and config should be pushed out correctly. If it is not a Spark gateway the default config is on the node which as you noticed does not work without making some changes.

In CM before 5.4 you also need to make sure that the correct YARN configuration is available on the node. CM 5.4 does that for you.

Wilfred

martins · ‎08-19-2015

We are using CDH 5.4.3 and what you described seems to work only when creating the cluster (though I have to verify this). When adding the Spark Gateway role to another node, this does not seem to work. Only the original Spark Gateway works. That is, unless the role is taken away from it. After that, I found no way how to restore the configuration and get at least one functional node with spark/yarn configuration.

Anyway, this seems like a bug in CM, so I guess I should file an issue.

martins · ‎08-19-2015

... or at least I would file an issue, but I could not find any issue tracker or submission form on cloudera sites.

srowen · ‎08-19-2015

You would need to contact Cloudera Support if you believe it's a problem, if you have a support contract.

I have successfully added Spark Gateway nodes after a cluster is live though without issues, so I suspect it's something else at work here.

andreF · ‎09-16-2015

In the case that the classpath obtained from classpath.txt is given only on the gateway machine, and the worker nodes will use the default configuration (so, no classpath.txt),

how the classpath will be synchronized on the whole application?

Should I set all worker nodes as gateways then ? o_O

André

Wilfred · ‎09-21-2015

The action of adding a gateway role for Spark on a new machine managed by CM I do on a regular basis for different versions of CM and have never had a problem with submitting an application from a node like that.

The classpath for the application is part of the submitted application context and not based on the executor path. How would you otherwise add classes to the classpath that are application specific?

Wilfred

andreF · ‎09-22-2015

Hi,

I'm creating an uber jar containing all application specific libs and dependencies. However my problem comes from a conflict of libs on classpath.txt (it contains conflicting versions of jackson - 2.2.3 and 2.3.1). My application uses jackson as 2.3.1 as dependency but somehow the wrong version of jackson on the classpath is used.

My idea was to modify the classpath.txt and deploy it on all worker nodes. Though this file is used only by Gateway nodes. i'm a bit puzzled with this lib conflict since in general, the classpath shound't contain several versions of the same lib. Shouldn't this be better handled on the classpath.txt?

srowen · ‎09-22-2015

You may be using a different version of Jackson, yes. The point is to
put your version in your app's classloader, which is not the same as
Spark's classloader. This can still be problematic, but in theory, the
isolation means the versions used are isolated and don't interfere.

Wilfred · ‎09-22-2015

To be completely in control I often recommend to use a shading tool for libraries like this.

Using maven shade or gradle shadow to make sure that your code references your version is a shure fire way to get this working.

When you build your project you "shade" the references in your code which means it always uses your version.

Wilfred

Cloudera Community

Support Questions

spark-submit on additional machine