Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

"RuntimeException: core-site.xml not found" while calling subprocess.call([])

avatar
New Contributor

After upgrading (fresh installation) to the Cloudera CDH 6.1 all our ETLs (pyspark scripts) are being failed. Withing the scripts we use subprocess.call([]) to work with HDFS directories which was working on CDH 5.13 but fails to execute on current release. It throws the following error: RuntimeException: core-site.xml not found

 

See the details below

 

$ sudo -u spark pyspark --master yarn --deploy-mode client 
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/03/11 20:24:42 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
19/03/11 20:24:43 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.1.0
/_/

Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkSession available as 'spark'.
>>> import subprocess
>>> subprocess.call(["hadoop", "fs", "-ls"])
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
Exception in thread "main" java.lang.RuntimeException: core-site.xml not found
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2891)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2839)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2716)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1353)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1325)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1666)
at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:339)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:569)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:174)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:156)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:389)

 

9 REPLIES 9

avatar
Expert Contributor

Hello @paramount2u,

 

It isn't able to find hadoop conf file. You can set config directory path by using as following:

export HADOOP_CONF_DIR=<put your configuration directory path>

Hope that helps.

avatar
New Contributor

Hi, 

 

Thank you for your response! I've tried to set HADOOP_CONF_DIR in bash before calling pyspark but it did not helped out. pyspark itself calls spark-env.sh script which overrides HADOOP_CONF_DIR variable (see the below).

 

 

HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf}
HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
if [ -d "$HIVE_CONF_DIR" ]; then
  HADOOP_CONF_DIR="$HADOOP_CONF_DIR:$HIVE_CONF_DIR"
fi
export HADOOP_CONF_DIR

 

As a result, HADOOP_CONF_DIR is assigned a string which is the combination of two directories:

 

 

>>> import os
>>> os.getenv('HADOOP_CONF_DIR')
'/opt/cloudera/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/conf/yarn-conf:/etc/hive/conf'

But, when I set the value manually to point to the single directory (either of the two above) subprocess routine starts working.

 

>>> os.environ['HADOOP_CONF_DIR'] = "/opt/cloudera/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/conf/yarn-conf"
>>> subprocess.call(["hadoop", "fs", "-ls"])
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
Found 2 items
drwxr-x---   - spark spark          0 2019-03-12 15:46 .sparkStaging
drwxrwxrwt   - spark spark          0 2019-03-12 15:46 applicationHistory

So, I assume that the problem comes from the code where HIVE_CONF_DIR is appended to HADOOP_CONF_DIR. 

Can you please check whether your deployement has such lines in spark-env.sh script?

 

avatar
New Contributor

Is there any resolution to this? I'm seeing it as well.

avatar
Moderator

Please follow the below steps to make changes in spark-env.sh advanced configuration snippet

1. Login to Cloudera manager
2 Choose "SPARK2_ON_YARN-1" on Cluster
3. Choose "Configurations" tab on the displayed page
4. Search "Spark 2 Client Advanced Configuration Snippet (Safety Valve) for spark2-conf/spark-env.sh " in the Search box displayed there.
5. In "Gateway Default Group " change the value to
export HADOOP_CONF_DIR=/etc/spark/conf/yarn-conf/*:/etc/hive/conf:/etc/hive/conf.

save the configurations and restart the server to make the changes into effect.


Ferenc Erdelyi, Technical Solutions Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

Learn more about the Cloudera Community:

avatar
Rising Star

We encountered the same problem after upgrading to CDH 6.3 from 5.15. The steps outlined by Bender helped us resolve the issue, with the following small differences:

  • We modified the advanced configuration of the Spark 2 service:
    Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh
  • The following line was added:
    export HADOOP_CONF_DIR=/etc/spark/conf/yarn-conf/*:/etc/hive/conf
  • No cluster or service restart was necessary, simply re-deploying the client configs did the trick.

avatar
Rising Star

Corrected:
export HADOOP_CONF_DIR=/etc/spark/conf/yarn-conf/*:/etc/hive/conf:$HADOOP_CONF_DIR

avatar
Contributor
HADOOP_CONF_DIR=$SPARK_CONF_DIR/yarn-conf/*
HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
if [ -d "$HIVE_CONF_DIR" ]; then
HADOOP_CONF_DIR="$HADOOP_CONF_DIR:$HIVE_CONF_DIR"
fi

I'm currently testing the above setting. It is essentially the same as what was already in the original spark-env.sh , I just modified the first line to not use the :- HADOOP_CONF_DIR default .

This approach is an alternative to using static values.

avatar
Contributor

After testing this, we are now seeing these messages for any type of Spark/Scala session in CDSW

 

Spark history logs show
java.io.FileNotFoundException: /tmp/spark-driver.log (Permission denied)

 

Removed the overrides for spark-env.sh and now those failing sessions are working again. More testing needed.

avatar
New Contributor

add this in spark service in CM > Spark > Configuration >Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh =export HADOOP_CONF_DIR=/etc/hadoop/conf