Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Connecting to remote spark cluster fails

avatar
Rising Star

Hello team,

 

We have a CDh cluster in cloud with version 6.2 and On Prem cluster at CDH 5.16, Kindly check below and suggest?

 

We able to list HDFS content from Cloud Gateway Node but while running pyspark from cloud VM it gets failed.

 

I have copied spark, hdfs, yarn config copied from On prem cluster to cloud Gatway node in different path and export that path using below.

 

Step1:

export SPARK_CONF_DIR=/app/localstorage/evl_prod/etc/spark2/conf.cloudera.spark2_on_yarn
export SPARK_DIST_CLASSPATH=$(hadoop --config /app/localstorage/evl_prod/etc/hadoop/ classpath)

 

Step2: 

updated spark-defaults.conf for spark.yarn.jars to update path for jar spark.

 

spark.yarn.jars=local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/hive/*,local:/app/bds/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/spark/lib/*

 

Step3:  Ran the pyspark from cloud Gateway Node, it throws below error in stderr of containers on On prem Resource manager Ui jogs logs.

 

Log Type: stderr

Log Upload Time: Fri Aug 30 06:46:04 -0400 2019

Log Length: 1082

Picked up JAVA_TOOL_OPTIONS: -Doracle.jdbc.thinLogonCapability=o3 -Djava.security.krb5.conf=/etc/krb5_bds.conf
19/08/30 06:46:03 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]
Unknown/unsupported param List(--dist-cache-conf, /app/bds/data/yarn/nm/01/usercache/t617351/appcache/application_1567160664350_0006/container_e22_1567160664350_0006_02_000001/__spark_conf__/__spark_dist_cache__.properties)

Usage: org.apache.spark.deploy.yarn.ApplicationMaster [options]
Options:
  --jar JAR_PATH       Path to your application's JAR file
  --class CLASS_NAME   Name of your application's main class
  --primary-py-file    A main Python file
  --primary-r-file     A main R file
  --py-files PY_FILES  Comma-separated list of .zip, .egg, or .py files to
                       place on the PYTHONPATH for Python apps.
  --args ARGS          Arguments to be passed to your application's main class.
                       Multiple invocations are possible, each will be passed in order.
  --properties-file FILE Path to a custom Spark properties file.
      
 

Log Type: stdout

Log Upload Time: Fri Aug 30 06:46:04 -0400 2019

Log Length: 0

3 REPLIES 3

avatar
Contributor

Hello @VijayM 

 

This error is due to the two clusters having different Spark major versions.

 

CDH5 is Spark 1.x, CDH6 is Spark 2.x - there are major differences between the two and code written in one may not run in the other.

 

To resolve, you should ensure the Spark versions on the cluster match.

avatar
New Contributor

You must follow the thread may working checkout... 

 

thanks

avatar
Super Guru
@VijayM ,

Couple of questions:

1. Are CDH6 and CDH5 managed by same Cloudera Manager, or you manage yourself?

2. From the setting you applied below:

spark.yarn.jars=local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/jars/*,local:/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/spark/hive/*,local:/app/bds/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/lib/spark/lib/*

It looks like you have both CDH6.2 and CDH5.16 running on the same host, is that right? Any reason you want to do so? As @JosiahGoodson mentioned, spark2 and spark1 are not compatible, you should either have Spark1, or Spark2 jars in the classpath, not both, otherwise they will conflict.

Cheers
Eric