Support Questions
Find answers, ask questions, and share your expertise

Spark Yarn integration on Hortonworks Sandbox 2.4

Highlighted

Spark Yarn integration on Hortonworks Sandbox 2.4

New Contributor

Hi, I'm running the Hortonworks sandbox 2.4 which comes with Spark 1.6.0. Trying to run a sample spark program and I'm able to submit it successfully with spark-submit. As I need to use this spark jar file in an external application, trying to use the "java -jar" command to achieve the same output as in spark-submit.

I'm using maven to build and used the maven shade plugin to build a fat jar as I was facing issues with the class not found exception earlier for the dependencies "spark-core_2.10" & "spark-yarn_2.10" artifacts. Now those issues are resolved.

However, the jar file is referring to the yarn-default.xml that comes with the dependencies inside the fat jar and not using the yarn-site.xml present in the Hortonworks sandbox. This is causing issues as below as it is not able to copy the file into hdfs.

java.io.FileNotFoundException: File file:/tmp/spark-1b318406-c7a1-4a94-9605- d6a46f0170d4/__spark_conf__5199620712586647591.zip does not exist

How can I make this jar file point to the Hortonworks sandbox settings from this default ones? If I don't build the fat jar, it was throwing the exception as below for a sample.

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/api/java/function/Function

3 REPLIES 3
Highlighted

Re: Spark Yarn integration on Hortonworks Sandbox 2.4

Hello @Pradeep K

Have you tried setting this environment variable on the target environment where you are running Spark job? The "yarn-site.xml" is typically located here in this conf directory.

We recommend that you set HADOOP_CONF_DIR to the appropriate directory; for example:

export HADOOP_CONF_DIR=/etc/hadoop/conf

In addition, make sure you configure "spark-defaults.conf" via Ambari under Spark service "Config" tab (or directly in $SPARK_HOME/conf if you are not running with Ambari). More instructions here:

spark-defaults.conf

Edit the spark-defaults.conf file in the Spark client /conf directory. Make sure the following values are specified, including hostname and port. (Note: if you installed the tech preview, these will already be in the file.) For example:

spark.yarn.historyServer.address c6401.ambari.apache.org:18080
spark.history.ui.port 18080
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.driver.extraJavaOptions -Dhdp.version=2.3.0.0-2800
spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider
spark.yarn.am.extraJavaOptions -Dhdp.version=2.3.0.0-2800
Highlighted

Re: Spark Yarn integration on Hortonworks Sandbox 2.4

New Contributor

Hello @Paul Hargis

I provided the below properties and gave a try. Similar error persists. The problem is the JAR file is referring to the yarn-default.xml that is embedded within the JAR which was downloaded by maven when building the fat jar. It is supposed to refer to yarn-site.xml and other files on the sandbox for things to work. The goal is to deploy and initiate this JAR through Spring Cloud Data flow. I'm exploring in that area as well if there is some option to override these properties.

Highlighted

Re: Spark Yarn integration on Hortonworks Sandbox 2.4

Okay, then have you tried copying the target file "yarn-site.xml" into the "src/main/resources" directory and then rebuilding the jar? This is the directory where maven will typically look for config files to be packaged with the fat jar. Granted, it is not a wonderful solution because it means each jar is already "targeted" for a given system (or cluster), but this is sometimes what is required.

Please note: yarn-default.xml and yarn-site.xml are both read by Hadoop YARN daemons, with yarn-default.xml denoting the defaults and yarn-site.xml representing custom configuration values that override those in yarn-default.xml.