Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to use Python3.7 in virtualenv with Spark

avatar
New Contributor

Hi community!

I am facing some problems in using Python 3.7 with spark-submit command.

I have both Python2.7 and Python 3.7 and I create a virtualenv in order to invoke Python3.7 as interpreter. When I test my code, I simply do "spark-submit mycode.py" but I get the following error:

SPARK_MAJOR_VERSION is set to 2, using
Spark2
  File “/usr/bin/hdp-select”, line 226
  print “ERROR: Invalid package – “ + name
 
^
SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(“ERROR:
Invalid package – “ + name)?
ls: cannot access /usr/hdp//adoop/lib: No such file or directory
Exception in thread “main” java.lang.IllegalStateException: hdp.version is not
set while running Spark under HDP, please set through HDP_VERSION in
spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
  at
org.apache.spark.launcher.Main.main(Main.java:118)

I have already tried to set using --conf options the hdp version when calling spark-submit but it did not work.

spark-submit --conf
"spark.driver.extraJavaOptions -Dhdp.version=2.6.0.3-8" --conf
"spark.yarn.am.extraJavaOptions -Dhdp.version=2.6.0.3-8" --conf
"spark.pyspark.python=/usr/local/bin/python3.7" --conf "spark.pyspark.driver.python=/usr/local/bin/python3.7" test2.py


If I try to execute the test code outside the virtualenv (with Python 2), it works properly.


I hope to figure out the problem..

Thanks

Cristina

2 REPLIES 2

avatar
New Contributor

I have the same issue. Were you able to get this working?

Thanks.

avatar
Contributor

I'm running RHEL and ran into similar problems due to the fun configuration of Python2 and 3 and SCL on RedHat.


The root cause is the /usr/bin/hdp-select script was written to be Python2 compatible.

The differences between Python2 and 3 are causing these issues, the script is unfortunately not compatible with both versions.


To resolve, we had to modify the hdp-select script to be compatible with both versions.

I would attach mine but it might break your environment as there is a lot of hardcoded values such as your HDP component versions. So you'll need to do these steps manually.


Steps:
1. Make a backup of the file.

sudo cp -p /usr/bin/hdp-select /usr/bin/hdp-select_original


2. As root, edit the file.

3. Add parenthesis around all print statements. Example below. Change all occurrences from:

print "a", "b", var, 123

to:

print ("a","b", var, 123)

Be careful of multi-line print statements, ending with \, or using multi-line strings. Recommend editing in a text editor that supports syntax highlighting to avoid any issues.

Also be aware that Python is sensitive to indentation so you don't want to change any spaces / tabs at the start of a line.

4. Change os.mkdir from:

    os.mkdir(current, 0755)

to:

    os.mkdir(current, 0o755)

5. Comment out the packages.sorted()

From

packages.sorted()

To

#packages.sorted()

(There are online tools for converting code from Python2 to 3 but they miss some of the above steps.)

6. Save and close the file

7. Test that hdp-select still works from shell. If so, you should be able to run spark-submit without issue.


A word of caution:

While these changes should be backwards compatible with Python 2, I am not sure what the longer-term impacts of these changes are, it may cause problems with other HDP components (though it seems highly unlikely).

Making changes to scripts outside of Ambari has other risks - Ambari or some other installation or upgrade process might replace the script with the one from your HDP software bundle, so your spark-submit could stop working if/when that happens.

I would file a bug report but we don't have Cloudera support at this time.