Created on 03-11-2019 09:35 AM - edited 09-16-2022 07:13 AM
After upgrading (fresh installation) to the Cloudera CDH 6.1 all our ETLs (pyspark scripts) are being failed. Withing the scripts we use subprocess.call([]) to work with HDFS directories which was working on CDH 5.13 but fails to execute on current release. It throws the following error: RuntimeException: core-site.xml not found
See the details below
$ sudo -u spark pyspark --master yarn --deploy-mode client Python 2.7.5 (default, Oct 30 2018, 23:45:53) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 19/03/11 20:24:42 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled. 19/03/11 20:24:43 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.1.0 /_/ Using Python version 2.7.5 (default, Oct 30 2018 23:45:53) SparkSession available as 'spark'. >>> import subprocess >>> subprocess.call(["hadoop", "fs", "-ls"]) WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete. Exception in thread "main" java.lang.RuntimeException: core-site.xml not found at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2891) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2839) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2716) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1353) at org.apache.hadoop.conf.Configuration.set(Configuration.java:1325) at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1666) at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:339) at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:569) at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:174) at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:156) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:389)
Created 03-12-2019 03:21 AM
Hello @paramount2u,
It isn't able to find hadoop conf file. You can set config directory path by using as following:
export HADOOP_CONF_DIR=<put your configuration directory path>
Hope that helps.
Created 03-12-2019 05:03 AM
Hi,
Thank you for your response! I've tried to set HADOOP_CONF_DIR in bash before calling pyspark but it did not helped out. pyspark itself calls spark-env.sh script which overrides HADOOP_CONF_DIR variable (see the below).
HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf} HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf} if [ -d "$HIVE_CONF_DIR" ]; then HADOOP_CONF_DIR="$HADOOP_CONF_DIR:$HIVE_CONF_DIR" fi export HADOOP_CONF_DIR
As a result, HADOOP_CONF_DIR is assigned a string which is the combination of two directories:
>>> import os >>> os.getenv('HADOOP_CONF_DIR') '/opt/cloudera/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/conf/yarn-conf:/etc/hive/conf'
But, when I set the value manually to point to the single directory (either of the two above) subprocess routine starts working.
>>> os.environ['HADOOP_CONF_DIR'] = "/opt/cloudera/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/conf/yarn-conf" >>> subprocess.call(["hadoop", "fs", "-ls"]) WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete. Found 2 items drwxr-x--- - spark spark 0 2019-03-12 15:46 .sparkStaging drwxrwxrwt - spark spark 0 2019-03-12 15:46 applicationHistory
So, I assume that the problem comes from the code where HIVE_CONF_DIR is appended to HADOOP_CONF_DIR.
Can you please check whether your deployement has such lines in spark-env.sh script?
Created 04-01-2019 12:18 PM
Is there any resolution to this? I'm seeing it as well.
Created 06-28-2019 02:49 AM
Please follow the below steps to make changes in spark-env.sh advanced configuration snippet
1. Login to Cloudera manager
2 Choose "SPARK2_ON_YARN-1" on Cluster
3. Choose "Configurations" tab on the displayed page
4. Search "Spark 2 Client Advanced Configuration Snippet (Safety Valve) for spark2-conf/spark-env.sh " in the Search box displayed there.
5. In "Gateway Default Group " change the value to
export HADOOP_CONF_DIR=/etc/spark/conf/yarn-conf/*:/etc/hive/conf:/etc/hive/conf.
save the configurations and restart the server to make the changes into effect.
Ferenc Erdelyi, Technical Solutions Manager
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 09-03-2019 08:02 AM
We encountered the same problem after upgrading to CDH 6.3 from 5.15. The steps outlined by Bender helped us resolve the issue, with the following small differences:
Created 09-09-2019 03:03 AM
Corrected:
export HADOOP_CONF_DIR=/etc/spark/conf/yarn-conf/*:/etc/hive/conf:$HADOOP_CONF_DIR
Created on 07-04-2020 10:43 PM - edited 07-04-2020 10:44 PM
HADOOP_CONF_DIR=$SPARK_CONF_DIR/yarn-conf/*
HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
if [ -d "$HIVE_CONF_DIR" ]; then
HADOOP_CONF_DIR="$HADOOP_CONF_DIR:$HIVE_CONF_DIR"
fi
I'm currently testing the above setting. It is essentially the same as what was already in the original spark-env.sh , I just modified the first line to not use the :- HADOOP_CONF_DIR default .
This approach is an alternative to using static values.
Created 07-06-2020 10:17 AM
After testing this, we are now seeing these messages for any type of Spark/Scala session in CDSW
Spark history logs show
java.io.FileNotFoundException: /tmp/spark-driver.log (Permission denied)
Removed the overrides for spark-env.sh and now those failing sessions are working again. More testing needed.
Created 04-23-2021 10:58 AM
add this in spark service in CM > Spark > Configuration >Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh =export HADOOP_CONF_DIR=/etc/hadoop/conf