I am facing some problems in using Python 3.7 with spark-submit command.
I have both Python2.7 and Python 3.7 and I create a virtualenv in order to invoke Python3.7 as interpreter. When I test my code, I simply do "spark-submit mycode.py" but I get the following error:
SPARK_MAJOR_VERSION is set to 2, using Spark2 File “/usr/bin/hdp-select”, line 226 print “ERROR: Invalid package – “ + name ^ SyntaxError: Missing parentheses in call to ‘print’. Did you mean print(“ERROR: Invalid package – “ + name)? ls: cannot access /usr/hdp//adoop/lib: No such file or directory Exception in thread “main” java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx at org.apache.spark.launcher.Main.main(Main.java:118)
I have already tried to set using --conf options the hdp version when calling spark-submit but it did not work.
spark-submit --conf "spark.driver.extraJavaOptions -Dhdp.version=18.104.22.168-8" --conf "spark.yarn.am.extraJavaOptions -Dhdp.version=22.214.171.124-8" --conf "spark.pyspark.python=/usr/local/bin/python3.7" --conf "spark.pyspark.driver.python=/usr/local/bin/python3.7" test2.py
If I try to execute the test code outside the virtualenv (with Python 2), it works properly.
I hope to figure out the problem..
I'm running RHEL and ran into similar problems due to the fun configuration of Python2 and 3 and SCL on RedHat.
The root cause is the /usr/bin/hdp-select script was written to be Python2 compatible.
The differences between Python2 and 3 are causing these issues, the script is unfortunately not compatible with both versions.
To resolve, we had to modify the hdp-select script to be compatible with both versions.
I would attach mine but it might break your environment as there is a lot of hardcoded values such as your HDP component versions. So you'll need to do these steps manually.
1. Make a backup of the file.
sudo cp -p /usr/bin/hdp-select /usr/bin/hdp-select_original
2. As root, edit the file.
3. Add parenthesis around all print statements. Example below. Change all occurrences from:
print "a", "b", var, 123
print ("a","b", var, 123)
Be careful of multi-line print statements, ending with \, or using multi-line strings. Recommend editing in a text editor that supports syntax highlighting to avoid any issues.
Also be aware that Python is sensitive to indentation so you don't want to change any spaces / tabs at the start of a line.
4. Change os.mkdir from:
5. Comment out the packages.sorted()
(There are online tools for converting code from Python2 to 3 but they miss some of the above steps.)
6. Save and close the file
7. Test that hdp-select still works from shell. If so, you should be able to run spark-submit without issue.
A word of caution:
While these changes should be backwards compatible with Python 2, I am not sure what the longer-term impacts of these changes are, it may cause problems with other HDP components (though it seems highly unlikely).
Making changes to scripts outside of Ambari has other risks - Ambari or some other installation or upgrade process might replace the script with the one from your HDP software bundle, so your spark-submit could stop working if/when that happens.
I would file a bug report but we don't have Cloudera support at this time.