Support Questions
Find answers, ask questions, and share your expertise

How to use Python3.7 in virtualenv with Spark

New Contributor

Hi community!

I am facing some problems in using Python 3.7 with spark-submit command.

I have both Python2.7 and Python 3.7 and I create a virtualenv in order to invoke Python3.7 as interpreter. When I test my code, I simply do "spark-submit mycode.py" but I get the following error:

SPARK_MAJOR_VERSION is set to 2, using
Spark2
  File “/usr/bin/hdp-select”, line 226
  print “ERROR: Invalid package – “ + name
 
^
SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(“ERROR:
Invalid package – “ + name)?
ls: cannot access /usr/hdp//adoop/lib: No such file or directory
Exception in thread “main” java.lang.IllegalStateException: hdp.version is not
set while running Spark under HDP, please set through HDP_VERSION in
spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
  at
org.apache.spark.launcher.Main.main(Main.java:118)

I have already tried to set using --conf options the hdp version when calling spark-submit but it did not work.

spark-submit --conf
"spark.driver.extraJavaOptions -Dhdp.version=2.6.0.3-8" --conf
"spark.yarn.am.extraJavaOptions -Dhdp.version=2.6.0.3-8" --conf
"spark.pyspark.python=/usr/local/bin/python3.7" --conf "spark.pyspark.driver.python=/usr/local/bin/python3.7" test2.py


If I try to execute the test code outside the virtualenv (with Python 2), it works properly.


I hope to figure out the problem..

Thanks

Cristina

2 REPLIES 2

Re: How to use Python3.7 in virtualenv with Spark

New Contributor

I have the same issue. Were you able to get this working?

Thanks.

Re: How to use Python3.7 in virtualenv with Spark

Explorer

I'm running RHEL and ran into similar problems due to the fun configuration of Python2 and 3 and SCL on RedHat.


The root cause is the /usr/bin/hdp-select script was written to be Python2 compatible.

The differences between Python2 and 3 are causing these issues, the script is unfortunately not compatible with both versions.


To resolve, we had to modify the hdp-select script to be compatible with both versions.

I would attach mine but it might break your environment as there is a lot of hardcoded values such as your HDP component versions. So you'll need to do these steps manually.


Steps:
1. Make a backup of the file.

sudo cp -p /usr/bin/hdp-select /usr/bin/hdp-select_original


2. As root, edit the file.

3. Add parenthesis around all print statements. Example below. Change all occurrences from:

print "a", "b", var, 123

to:

print ("a","b", var, 123)

Be careful of multi-line print statements, ending with \, or using multi-line strings. Recommend editing in a text editor that supports syntax highlighting to avoid any issues.

Also be aware that Python is sensitive to indentation so you don't want to change any spaces / tabs at the start of a line.

4. Change os.mkdir from:

    os.mkdir(current, 0755)

to:

    os.mkdir(current, 0o755)

5. Comment out the packages.sorted()

From

packages.sorted()

To

#packages.sorted()

(There are online tools for converting code from Python2 to 3 but they miss some of the above steps.)

6. Save and close the file

7. Test that hdp-select still works from shell. If so, you should be able to run spark-submit without issue.


A word of caution:

While these changes should be backwards compatible with Python 2, I am not sure what the longer-term impacts of these changes are, it may cause problems with other HDP components (though it seems highly unlikely).

Making changes to scripts outside of Ambari has other risks - Ambari or some other installation or upgrade process might replace the script with the one from your HDP software bundle, so your spark-submit could stop working if/when that happens.

I would file a bug report but we don't have Cloudera support at this time.