Member since
03-11-2019
2
Posts
2
Kudos Received
0
Solutions
03-12-2019
05:03 AM
2 Kudos
Hi, Thank you for your response! I've tried to set HADOOP_CONF_DIR in bash before calling pyspark but it did not helped out. pyspark itself calls spark-env.sh script which overrides HADOOP_CONF_DIR variable (see the below). HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf}
HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
if [ -d "$HIVE_CONF_DIR" ]; then
HADOOP_CONF_DIR="$HADOOP_CONF_DIR:$HIVE_CONF_DIR"
fi
export HADOOP_CONF_DIR As a result, HADOOP_CONF_DIR is assigned a string which is the combination of two directories: >>> import os
>>> os.getenv('HADOOP_CONF_DIR')
'/opt/cloudera/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/conf/yarn-conf:/etc/hive/conf' But, when I set the value manually to point to the single directory (either of the two above) subprocess routine starts working. >>> os.environ['HADOOP_CONF_DIR'] = "/opt/cloudera/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/conf/yarn-conf"
>>> subprocess.call(["hadoop", "fs", "-ls"])
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
Found 2 items
drwxr-x--- - spark spark 0 2019-03-12 15:46 .sparkStaging
drwxrwxrwt - spark spark 0 2019-03-12 15:46 applicationHistory So, I assume that the problem comes from the code where HIVE_CONF_DIR is appended to HADOOP_CONF_DIR. Can you please check whether your deployement has such lines in spark-env.sh script?
... View more
03-11-2019
09:35 AM
After upgrading (fresh installation) to the Cloudera CDH 6.1 all our ETLs (pyspark scripts) are being failed. Withing the scripts we use subprocess.call([]) to work with HDFS directories which was working on CDH 5.13 but fails to execute on current release. It throws the following error: RuntimeException: core-site.xml not found See the details below $ sudo -u spark pyspark --master yarn --deploy-mode client
Python 2.7.5 (default, Oct 30 2018, 23:45:53)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/03/11 20:24:42 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
19/03/11 20:24:43 WARN lineage.LineageWriter: Lineage directory /var/log/spark/lineage doesn't exist or is not writable. Lineage for this application will be disabled.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0-cdh6.1.0
/_/
Using Python version 2.7.5 (default, Oct 30 2018 23:45:53)
SparkSession available as 'spark'.
>>> import subprocess
>>> subprocess.call(["hadoop", "fs", "-ls"])
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
Exception in thread "main" java.lang.RuntimeException: core-site.xml not found
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2891)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2839)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2716)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1353)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1325)
at org.apache.hadoop.conf.Configuration.setBoolean(Configuration.java:1666)
at org.apache.hadoop.util.GenericOptionsParser.processGeneralOptions(GenericOptionsParser.java:339)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:569)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:174)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:156)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:389)
... View more
Labels:
- Labels:
-
Apache Spark
-
HDFS
-
Manual Installation