question Re: How to run spark-submit in virtualenv for pyspark? in Support Questions

How to run spark-submit in virtualenv for pyspark?

rvillanueva — Fri, 13 Dec 2019 00:15:25 GMT

Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit, but attempting to do so I get...

[me@myserver tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py 
  File "/bin/hdp-select", line 255    print "ERROR: Invalid package - " + name
                                    ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?ls: cannot access /usr/hdp//hadoop/lib: No such file or directoryException in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
    at org.apache.spark.launcher.Main.main(Main.java:118)

# also tried...
(venv) [me@myserver tests]$ export HADOOP_CONF_DIR=/etc/hadoop/conf; spark-submit --master yarn --deploy-mode cluster sparksubmit.test.py 19/12/12 13:50:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable19/12/12 13:50:20 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig    at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
    ....    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

Not sure what to make of this or how to proceed further and did not totally understand the error message after googling it.

Anyone with more experience have any further debugging tips for this or fixes?

Re: How to run spark-submit in virtualenv for pyspark?

jsensharma — Fri, 13 Dec 2019 06:35:52 GMT

@rvillanueva

There seems to be couple of issues:

Issue-1. The other issue seems to be related to Python3. Because Python3 does not support print statements without parentheses. Thats why you are getting this error:

File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?

Please refer to the following thread for similar discussion.

https://community.cloudera.com/t5/Support-Questions/Spark-submit-error-with-Python3-on-Hortonworks-sandbox-VM/td-p/230117
https://community.cloudera.com/t5/Support-Questions/HDP3-0-livy-server-cannot-start/td-p/231126

Try using Python2.7 (Instead of Python 3) because the script "/bin/hdp-select" contains many "print" statements without parentheses. But Python3 expects that all the 'print' statements must be in parentheses.

# grep 'print ' /bin/hdp-select

Issue-2. The following line indicates that somewhere in your code or "../venv/bin/activate" or "sparksubmit.test.py " script you might have set incorrect Path.

ls: cannot access /usr/hdp//hadoop/lib: No such file or directory

This is because the correct path should be "/usr/hdp/current/hadoop/lib".

NOTICE the "current" is missing in your case.
(In your environment looks like some where it is coming as Blank "/usr/hdp//hadoop/lib")

Issue-3. The "ClassNotFoundException" related errors are side effect of the above point where we see that the corret lib directory path is not present because in your printed path "current" is missing in "/usr/hdp/current/hadoop/lib" so the correct JARs are not getting included in the CLASSPATH..

Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

Re: How to run spark-submit in virtualenv for pyspark?

jsensharma — Fri, 13 Dec 2019 06:38:50 GMT

@rvillanueva

In addition to my previous comment also please refer to: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/running-spark-applications/content/setting_path_variables_for_python.html

Re: How to run spark-submit in virtualenv for pyspark?

rvillanueva — Fri, 13 Dec 2019 21:06:59 GMT

@jsensharma

1. Need to use python3 and would like to continue to do so in the future considering that python2 will stop being maintained in 2020 (I would think others would have a similar desire as well) and am currently adding the option

export PYSPARK_PYTHON=/path/to/my/virtualenv/bin/python; spark-submit sparksubmit.test.py

as a workaround (else, this may be helpful: https://stackoverflow.com/a/51508990/8236733 or using the --pyfiles option).

2. IDK where that path reference is coming from since "../venv/bin/activate" is just activating a virtualenv and "sparksubmit.test.py" code is just

from os import environ import time import pprint import platform pp = pprint.PrettyPrinter(indent=4) sparkSession = SparkSession.builder.appName("TEST").getOrCreate() sparkSession._jsc.sc().setLogLevel("WARN") print(platform.python_version()) def testfunc(num: int) -> str: return "type annotations look ok" print(testfunc(1)) print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size()) pp.pprint(sparkSession.sparkContext._conf.getAll())

but that blank space in "/usr/hdp//hadoop/lib" is interesting to see, especially since I use

export HADOOP_CONF_DIR=/etc/hadoop/conf

for the HADOOP_CONF_DIR in the terminal when trying to run the command. Furthermore, looking at my (client node) FS, I don't even see that path...

[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop- hadoop-client/ hadoop-httpfs hadoop-hdfs-client/ hadoop-mapreduce-client/ hadoop-hdfs-datanode/ hadoop-mapreduce-historyserver/ hadoop-hdfs-journalnode/ hadoop-yarn-client/ hadoop-hdfs-namenode/ hadoop-yarn-nodemanager/ hadoop-hdfs-nfs3/ hadoop-yarn-registrydns/ hadoop-hdfs-portmap/ hadoop-yarn-resourcemanager/ hadoop-hdfs-secondarynamenode/ hadoop-yarn-timelinereader/ hadoop-hdfs-zkfc/ hadoop-yarn-timelineserver/ [airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop ls: cannot access /usr/hdp/current/hadoop: No such file or directory

(note I am using HDP v3.1.0)

Re: How to run spark-submit in virtualenv for pyspark?

kshimpi — Mon, 16 Dec 2019 15:31:40 GMT

@rvillanueva

Please refer article

https://community.cloudera.com/t5/Customer/Unable-to-start-Pyspark-jobs-when-running-with-Python-3/ta-p/272990