Reply
New Contributor
Posts: 4
Registered: ‎03-06-2017

PySpark - Error initializing SparkContext

 

We are running into issues when we launch PySpark (with or without Yarn).

It seems to be looking for hive-site.xml file which we already copied to spark configuration path but I am not sure if there are any specific parameters that should be part of.

 

 

 

 

 

 

[apps@devdm003.dev1 ~]$ pyspark --master yarn --verbose
WARNING: User-defined SPARK_HOME (/opt/spark) overrides detected (/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark).
WARNING: Running pyspark from user-defined location.
Python 2.7.8 (default, Oct 22 2016, 09:02:55)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-17)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using properties file: /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark/conf/spark-defaults.conf
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.yarn.jars=hdfs://devdm001.dev1.turn.com:8020/user/spark/spark-2.1-bin-hadoop/*
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.shuffle.service.enabled=true
Adding default property: spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
Adding default property: spark.yarn.historyServer.address=http://devdm004.dev1.turn.com:18088
Adding default property: spark.dynamicAllocation.schedulerBacklogTimeout=1
Adding default property: spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
Adding default property: spark.yarn.config.gatewayPath=/opt/cloudera/parcels
Adding default property: spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
Adding default property: spark.shuffle.service.port=7337
Adding default property: spark.master=yarn
Adding default property: spark.authenticate=false
Adding default property: spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
Adding default property: spark.eventLog.dir=hdfs://devdm001.dev1.turn.com:8020/user/spark/applicationHistory
Adding default property: spark.dynamicAllocation.enabled=true
Adding default property: spark.dynamicAllocation.minExecutors=0
Adding default property: spark.dynamicAllocation.executorIdleTimeout=60
Parsed arguments:
master yarn
deployMode null
executorMemory null
executorCores null
totalExecutorCores null
propertiesFile /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark/conf/spark-defaults.conf
driverMemory null
driverCores null
driverExtraClassPath null
driverExtraLibraryPath /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
driverExtraJavaOptions null
supervise false
queue null
numExecutors null
files null
pyFiles null
archives null
mainClass null
primaryResource pyspark-shell
name PySparkShell
childArgs []
jars null
packages null
packagesExclusions null
repositories null
verbose true

Spark properties used, including those specified through
--conf and those from the properties file /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/spark/conf/spark-defaults.conf:
spark.executor.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.yarn.jars -> hdfs://devdm001.dev1.turn.com:8020/user/spark/spark-2.1-bin-hadoop/*
spark.driver.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.authenticate -> false
spark.yarn.historyServer.address -> http://devdm004.dev1.turn.com:18088
spark.yarn.am.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.eventLog.enabled -> true
spark.dynamicAllocation.schedulerBacklogTimeout -> 1
spark.yarn.config.gatewayPath -> /opt/cloudera/parcels
spark.serializer -> org.apache.spark.serializer.KryoSerializer
spark.dynamicAllocation.executorIdleTimeout -> 60
spark.dynamicAllocation.minExecutors -> 0
spark.shuffle.service.enabled -> true
spark.yarn.config.replacementPath -> {{HADOOP_COMMON_HOME}}/../../..
spark.shuffle.service.port -> 7337
spark.eventLog.dir -> hdfs://devdm001.dev1.turn.com:8020/user/spark/applicationHistory
spark.master -> yarn
spark.dynamicAllocation.enabled -> true


Main class:
org.apache.spark.api.python.PythonGatewayServer
Arguments:

System properties:
spark.executor.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.driver.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.yarn.jars -> hdfs://devdm001.dev1.turn.com:8020/user/spark/spark-2.1-bin-hadoop/*
spark.authenticate -> false
spark.yarn.historyServer.address -> http://devdm004.dev1.turn.com:18088
spark.yarn.am.extraLibraryPath -> /opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.eventLog.enabled -> true
spark.dynamicAllocation.schedulerBacklogTimeout -> 1
SPARK_SUBMIT -> true
spark.yarn.config.gatewayPath -> /opt/cloudera/parcels
spark.serializer -> org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled -> true
spark.dynamicAllocation.minExecutors -> 0
spark.dynamicAllocation.executorIdleTimeout -> 60
spark.app.name -> PySparkShell
spark.yarn.config.replacementPath -> {{HADOOP_COMMON_HOME}}/../../..
spark.submit.deployMode -> client
spark.shuffle.service.port -> 7337
spark.eventLog.dir -> hdfs://devdm001.dev1.turn.com:8020/user/spark/applicationHistory
spark.master -> yarn
spark.yarn.isPython -> true
spark.dynamicAllocation.enabled -> true
Classpath elements:

 

log4j:ERROR Could not find value for key log4j.appender.WARN
log4j:ERROR Could not instantiate appender named "WARN".
log4j:ERROR Could not find value for key log4j.appender.DEBUG
log4j:ERROR Could not instantiate appender named "DEBUG".
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/jars/avro-tools-1.7.6-cdh5.5.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/server/turn/deploy/160622/turn/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Traceback (most recent call last):
File "/opt/spark/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/opt/spark/python/pyspark/sql/session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/opt/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"

 

 

 

 

 

We installed Spark 2.1 for business reasons and updated SPARK_HOME variable in safety valve.

(Ensured SPARK_HOME is set early in spark-env.sh so other PATH variables are set properly).

 

I also learnt that there is no hive-site.xml dependency with spark 2.1 which confuses me more for reasons it is looking into.

 

Did anyone face similar issue, any suggestions? This is a linux environment running CDH5.5.4

 

 

 

 

Cloudera Employee
Posts: 33
Registered: ‎04-05-2016

Re: PySpark - Error initializing SparkContext

By Spark 2.1 do you mean Cloudera Spark 2.0 Release 1 or Apache Spark 2.1 ? 

 

Regarding Cloudera Spark 2.0 Release 1 or Release 2, I would like to tell that minimum required CDH version is CDH 5.7.x but you are on CDH 5.5.4. 

New Contributor
Posts: 4
Registered: ‎03-06-2017

Re: PySpark - Error initializing SparkContext

It is apache spark 2.1

One Reference used (similar to our situation) :


https://www.linkedin.com/pulse/running-spark-2xx-cloudera-hadoop-distro-cdh-deenar-toraskar-cfa

Highlighted
New Contributor
Posts: 4
Registered: ‎03-06-2017

Re: PySpark - Error initializing SparkContext

The problem seems to be with configuration rather than dependency, I am not sure what configuration is missing. 

 

Here is my configuration  :

 

spark-defaults.conf :

 

spark.authenticate=false
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=60
spark.dynamicAllocation.minExecutors=0
spark.dynamicAllocation.schedulerBacklogTimeout=1
spark.eventLog.dir=hdfs://dtest.turn.com:8020/user/spark/applicationHistory
spark.eventLog.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
spark.master=yarn
spark.yarn.jars=hdfs://dtest.turn.com:8020/user/spark/spark-2.1-bin-hadoop/*
spark.yarn.historyServer.address=http://dtest.turn.com:18088
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.5.4-1.cdh5.5.4.p0.9/lib/hadoop/lib/native
spark.yarn.config.gatewayPath=/opt/cloudera/parcels
spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..

 

 

 

 

Announcements