Created on 12-12-2019 03:35 PM - last edited on 12-12-2019 04:15 PM by ask_bill_brooks
Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit, but attempting to do so I get...
[me@myserver tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?ls: cannot access /usr/hdp//hadoop/lib: No such file or directoryException in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx at org.apache.spark.launcher.Main.main(Main.java:118) # also tried... (venv) [me@myserver tests]$ export HADOOP_CONF_DIR=/etc/hadoop/conf; spark-submit --master yarn --deploy-mode cluster sparksubmit.test.py 19/12/12 13:50:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable19/12/12 13:50:20 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55) .... at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
Not sure what to make of this or how to proceed further and did not totally understand the error message after googling it.
Anyone with more experience have any further debugging tips for this or fixes?
Created on 12-13-2019 12:03 PM - edited 12-13-2019 01:06 PM
1. Need to use python3 and would like to continue to do so in the future considering that python2 will stop being maintained in 2020 (I would think others would have a similar desire as well) and am currently adding the option
export PYSPARK_PYTHON=/path/to/my/virtualenv/bin/python; spark-submit sparksubmit.test.py
as a workaround (else, this may be helpful: https://stackoverflow.com/a/51508990/8236733 or using the --pyfiles option).
2. IDK where that path reference is coming from since "../venv/bin/activate" is just activating a virtualenv and "sparksubmit.test.py" code is just
from os import environ
import time
import pprint
import platform
pp = pprint.PrettyPrinter(indent=4)
sparkSession = SparkSession.builder.appName("TEST").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")
print(platform.python_version())
def testfunc(num: int) -> str:
return "type annotations look ok"
print(testfunc(1))
print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())
pp.pprint(sparkSession.sparkContext._conf.getAll())
but that blank space in "/usr/hdp//hadoop/lib" is interesting to see, especially since I use
export HADOOP_CONF_DIR=/etc/hadoop/conf
for the HADOOP_CONF_DIR in the terminal when trying to run the command. Furthermore, looking at my (client node) FS, I don't even see that path...
[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop-
hadoop-client/ hadoop-httpfs
hadoop-hdfs-client/ hadoop-mapreduce-client/
hadoop-hdfs-datanode/ hadoop-mapreduce-historyserver/
hadoop-hdfs-journalnode/ hadoop-yarn-client/
hadoop-hdfs-namenode/ hadoop-yarn-nodemanager/
hadoop-hdfs-nfs3/ hadoop-yarn-registrydns/
hadoop-hdfs-portmap/ hadoop-yarn-resourcemanager/
hadoop-hdfs-secondarynamenode/ hadoop-yarn-timelinereader/
hadoop-hdfs-zkfc/ hadoop-yarn-timelineserver/
[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop
ls: cannot access /usr/hdp/current/hadoop: No such file or directory
(note I am using HDP v3.1.0)
Created 12-12-2019 10:35 PM
There seems to be couple of issues:
Issue-1. The other issue seems to be related to Python3. Because Python3 does not support print statements without parentheses. Thats why you are getting this error:
File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?
Please refer to the following thread for similar discussion.
https://community.cloudera.com/t5/Support-Questions/Spark-submit-error-with-Python3-on-Hortonworks-s...
https://community.cloudera.com/t5/Support-Questions/HDP3-0-livy-server-cannot-start/td-p/231126
Try using Python2.7 (Instead of Python 3) because the script "/bin/hdp-select" contains many "print" statements without parentheses. But Python3 expects that all the 'print' statements must be in parentheses.
# grep 'print ' /bin/hdp-select
.
Issue-2. The following line indicates that somewhere in your code or "../venv/bin/activate" or "sparksubmit.test.py " script you might have set incorrect Path.
ls: cannot access /usr/hdp//hadoop/lib: No such file or directory
This is because the correct path should be "/usr/hdp/current/hadoop/lib".
NOTICE the "current" is missing in your case.
(In your environment looks like some where it is coming as Blank "/usr/hdp//hadoop/lib")
.
Issue-3. The "ClassNotFoundException" related errors are side effect of the above point where we see that the corret lib directory path is not present because in your printed path "current" is missing in "/usr/hdp/current/hadoop/lib" so the correct JARs are not getting included in the CLASSPATH..
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig
.
Created 12-12-2019 10:38 PM
In addition to my previous comment also please refer to: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/running-spark-applications/content/setting_pat...
Created on 12-13-2019 12:03 PM - edited 12-13-2019 01:06 PM
1. Need to use python3 and would like to continue to do so in the future considering that python2 will stop being maintained in 2020 (I would think others would have a similar desire as well) and am currently adding the option
export PYSPARK_PYTHON=/path/to/my/virtualenv/bin/python; spark-submit sparksubmit.test.py
as a workaround (else, this may be helpful: https://stackoverflow.com/a/51508990/8236733 or using the --pyfiles option).
2. IDK where that path reference is coming from since "../venv/bin/activate" is just activating a virtualenv and "sparksubmit.test.py" code is just
from os import environ
import time
import pprint
import platform
pp = pprint.PrettyPrinter(indent=4)
sparkSession = SparkSession.builder.appName("TEST").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")
print(platform.python_version())
def testfunc(num: int) -> str:
return "type annotations look ok"
print(testfunc(1))
print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())
pp.pprint(sparkSession.sparkContext._conf.getAll())
but that blank space in "/usr/hdp//hadoop/lib" is interesting to see, especially since I use
export HADOOP_CONF_DIR=/etc/hadoop/conf
for the HADOOP_CONF_DIR in the terminal when trying to run the command. Furthermore, looking at my (client node) FS, I don't even see that path...
[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop-
hadoop-client/ hadoop-httpfs
hadoop-hdfs-client/ hadoop-mapreduce-client/
hadoop-hdfs-datanode/ hadoop-mapreduce-historyserver/
hadoop-hdfs-journalnode/ hadoop-yarn-client/
hadoop-hdfs-namenode/ hadoop-yarn-nodemanager/
hadoop-hdfs-nfs3/ hadoop-yarn-registrydns/
hadoop-hdfs-portmap/ hadoop-yarn-resourcemanager/
hadoop-hdfs-secondarynamenode/ hadoop-yarn-timelinereader/
hadoop-hdfs-zkfc/ hadoop-yarn-timelineserver/
[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop
ls: cannot access /usr/hdp/current/hadoop: No such file or directory
(note I am using HDP v3.1.0)
Created on 12-16-2019 07:29 AM - edited 12-16-2019 07:31 AM