Support Questions

Find answers, ask questions, and share your expertise

How to run spark-submit in virtualenv for pyspark?

avatar
Expert Contributor

Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit, but attempting to do so I get...

[me@myserver tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py 
  File "/bin/hdp-select", line 255    print "ERROR: Invalid package - " + name
                                    ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?ls: cannot access /usr/hdp//hadoop/lib: No such file or directoryException in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
    at org.apache.spark.launcher.Main.main(Main.java:118)

# also tried...
(venv) [me@myserver tests]$ export HADOOP_CONF_DIR=/etc/hadoop/conf; spark-submit --master yarn --deploy-mode cluster sparksubmit.test.py 19/12/12 13:50:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable19/12/12 13:50:20 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig    at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
    ....    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

Not sure what to make of this or how to proceed further and did not totally understand the error message after googling it.

Anyone with more experience have any further debugging tips for this or fixes?

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@jsensharma 

1. Need to use python3 and would like to continue to do so in the future considering that python2 will stop being maintained in 2020 (I would think others would have a similar desire as well) and am currently adding the option

export PYSPARK_PYTHON=/path/to/my/virtualenv/bin/python; spark-submit sparksubmit.test.py

 as a workaround (else, this may be helpful: https://stackoverflow.com/a/51508990/8236733 or using the --pyfiles option).

 

2. IDK where that path reference is coming from since "../venv/bin/activate" is just activating a virtualenv and "sparksubmit.test.py" code is just

 

 

 

 

from os import environ
import time
import pprint
import platform

pp = pprint.PrettyPrinter(indent=4)

sparkSession = SparkSession.builder.appName("TEST").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")

print(platform.python_version())

def testfunc(num: int) -> str:
    return "type annotations look ok"
print(testfunc(1))

print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())

pp.pprint(sparkSession.sparkContext._conf.getAll())

 

 

 

 

but that blank space in "/usr/hdp//hadoop/lib" is interesting to see, especially since I use 

 

 

 

 

export HADOOP_CONF_DIR=/etc/hadoop/conf

 

 

 

 

for the HADOOP_CONF_DIR in the terminal when trying to run the command. Furthermore, looking at my (client node) FS, I don't even see that path...

 

 

[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop-
hadoop-client/                  hadoop-httpfs
hadoop-hdfs-client/             hadoop-mapreduce-client/
hadoop-hdfs-datanode/           hadoop-mapreduce-historyserver/
hadoop-hdfs-journalnode/        hadoop-yarn-client/
hadoop-hdfs-namenode/           hadoop-yarn-nodemanager/
hadoop-hdfs-nfs3/               hadoop-yarn-registrydns/
hadoop-hdfs-portmap/            hadoop-yarn-resourcemanager/
hadoop-hdfs-secondarynamenode/  hadoop-yarn-timelinereader/
hadoop-hdfs-zkfc/               hadoop-yarn-timelineserver/
[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop
ls: cannot access /usr/hdp/current/hadoop: No such file or directory

 

 

(note I am using HDP v3.1.0)

View solution in original post

4 REPLIES 4

avatar
Master Mentor

@rvillanueva 

There seems to be couple of issues:


Issue-1. The other issue seems to be related to Python3. Because Python3 does not support print statements without parentheses. Thats why you are getting this error:

File "/bin/hdp-select", line 255 print "ERROR: Invalid package - " + name
^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?

Please refer to the following thread for similar discussion.

https://community.cloudera.com/t5/Support-Questions/Spark-submit-error-with-Python3-on-Hortonworks-s...
https://community.cloudera.com/t5/Support-Questions/HDP3-0-livy-server-cannot-start/td-p/231126

Try using Python2.7 (Instead of Python 3) because the script "/bin/hdp-select" contains many "print" statements without parentheses. But Python3 expects that all the 'print' statements must be in parentheses.

# grep 'print ' /bin/hdp-select

.

Issue-2. The following line indicates that somewhere in your code or "../venv/bin/activate" or "sparksubmit.test.py " script you might have set incorrect Path.

ls: cannot access /usr/hdp//hadoop/lib: No such file or directory

This is because the correct path should be "/usr/hdp/current/hadoop/lib".

NOTICE the "current" is missing in your case.
(In your environment looks like some where it is coming as Blank "/usr/hdp//hadoop/lib")

.

Issue-3. The "ClassNotFoundException" related errors are side effect of the above point where we see that the corret lib directory path is not present because in your printed path "current" is missing in "/usr/hdp/current/hadoop/lib" so the correct JARs are not getting included in the CLASSPATH..

Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

 

.

 

avatar
Master Mentor

avatar
Expert Contributor

@jsensharma 

1. Need to use python3 and would like to continue to do so in the future considering that python2 will stop being maintained in 2020 (I would think others would have a similar desire as well) and am currently adding the option

export PYSPARK_PYTHON=/path/to/my/virtualenv/bin/python; spark-submit sparksubmit.test.py

 as a workaround (else, this may be helpful: https://stackoverflow.com/a/51508990/8236733 or using the --pyfiles option).

 

2. IDK where that path reference is coming from since "../venv/bin/activate" is just activating a virtualenv and "sparksubmit.test.py" code is just

 

 

 

 

from os import environ
import time
import pprint
import platform

pp = pprint.PrettyPrinter(indent=4)

sparkSession = SparkSession.builder.appName("TEST").getOrCreate()
sparkSession._jsc.sc().setLogLevel("WARN")

print(platform.python_version())

def testfunc(num: int) -> str:
    return "type annotations look ok"
print(testfunc(1))

print("\n\nYou are using %d nodes in this session\n\n" % sparkSession._jsc.sc().getExecutorMemoryStatus().keySet().size())

pp.pprint(sparkSession.sparkContext._conf.getAll())

 

 

 

 

but that blank space in "/usr/hdp//hadoop/lib" is interesting to see, especially since I use 

 

 

 

 

export HADOOP_CONF_DIR=/etc/hadoop/conf

 

 

 

 

for the HADOOP_CONF_DIR in the terminal when trying to run the command. Furthermore, looking at my (client node) FS, I don't even see that path...

 

 

[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop-
hadoop-client/                  hadoop-httpfs
hadoop-hdfs-client/             hadoop-mapreduce-client/
hadoop-hdfs-datanode/           hadoop-mapreduce-historyserver/
hadoop-hdfs-journalnode/        hadoop-yarn-client/
hadoop-hdfs-namenode/           hadoop-yarn-nodemanager/
hadoop-hdfs-nfs3/               hadoop-yarn-registrydns/
hadoop-hdfs-portmap/            hadoop-yarn-resourcemanager/
hadoop-hdfs-secondarynamenode/  hadoop-yarn-timelinereader/
hadoop-hdfs-zkfc/               hadoop-yarn-timelineserver/
[airflow@airflowetl tests]$ ls -lha /usr/hdp/current/hadoop
ls: cannot access /usr/hdp/current/hadoop: No such file or directory

 

 

(note I am using HDP v3.1.0)

avatar
Cloudera Employee