Created on 02-14-2017 10:49 AM - edited 09-16-2022 04:05 AM
I don't know who's responsible for writing topology.py, but it uses Python 2 syntax, so if I try to run PySpark with Python 3 using
export PYSPARK_PYTHON=python3
I get tons of stacktraces.
Uri
Created 02-14-2017 10:50 AM
Created 02-14-2017 04:16 PM
Depending on how you are running the job, you may be able to override the topology script parameter and/or replace the toplogy.py script with one that is python 3 compatible. If you're submitting jobs from the command line, you'd usually copy /etc/hadoop/conf to some custom directory /path/to/customized/conf, make changes there, then set HADOOP_CONF_DIR=/path/to/customized/conf and run your job.
Assuming you can change that topology script, here's the relevant portion of the diff that you can apply:
@@ -1,8 +1,8 @@ #!/usr/bin/env python # -# Copyright (c) 2010-2012 Cloudera, Inc. All rights reserved. +# Copyright (c) 2016 Cloudera, Inc. All rights reserved. # - + ''' This script is provided by CMF for hadoop to determine network/rack topology. It is automatically generated and could be replaced at any time. Any changes @@ -12,8 +12,13 @@ made to it will be lost when this happens. import os import sys import xml.dom.minidom -from string import join - + +try: + xrange +except NameError: + # support for python3, which basically renamed xrange to range + xrange = range + def main(): MAP_FILE = '{{CMF_CONF_DIR}}/topology.map' DEFAULT_RACK = '/default' @@ -40,14 +45,14 @@ def main(): map[node.getAttribute("name")] = node.getAttribute("rack") except: default_rack = "".join([ DEFAULT_RACK for _ in xrange(max_elements)]) - print default_rack + print(default_rack) return -1 - + default_rack = "".join([ DEFAULT_RACK for _ in xrange(max_elements)]) if len(sys.argv)==1: - print default_rack + print(default_rack) else: - print join([map.get(i, default_rack) for i in sys.argv[1:]], " ") + print(" ".join([map.get(i, default_rack) for i in sys.argv[1:]])) return 0 if __name__ == "__main__":
Created 02-14-2017 10:50 AM
Created 02-14-2017 12:44 PM
Any workaround for earlier versions? I'm on 5.5 and I don't manage the cluster.
Created 02-14-2017 04:16 PM
Depending on how you are running the job, you may be able to override the topology script parameter and/or replace the toplogy.py script with one that is python 3 compatible. If you're submitting jobs from the command line, you'd usually copy /etc/hadoop/conf to some custom directory /path/to/customized/conf, make changes there, then set HADOOP_CONF_DIR=/path/to/customized/conf and run your job.
Assuming you can change that topology script, here's the relevant portion of the diff that you can apply:
@@ -1,8 +1,8 @@ #!/usr/bin/env python # -# Copyright (c) 2010-2012 Cloudera, Inc. All rights reserved. +# Copyright (c) 2016 Cloudera, Inc. All rights reserved. # - + ''' This script is provided by CMF for hadoop to determine network/rack topology. It is automatically generated and could be replaced at any time. Any changes @@ -12,8 +12,13 @@ made to it will be lost when this happens. import os import sys import xml.dom.minidom -from string import join - + +try: + xrange +except NameError: + # support for python3, which basically renamed xrange to range + xrange = range + def main(): MAP_FILE = '{{CMF_CONF_DIR}}/topology.map' DEFAULT_RACK = '/default' @@ -40,14 +45,14 @@ def main(): map[node.getAttribute("name")] = node.getAttribute("rack") except: default_rack = "".join([ DEFAULT_RACK for _ in xrange(max_elements)]) - print default_rack + print(default_rack) return -1 - + default_rack = "".join([ DEFAULT_RACK for _ in xrange(max_elements)]) if len(sys.argv)==1: - print default_rack + print(default_rack) else: - print join([map.get(i, default_rack) for i in sys.argv[1:]], " ") + print(" ".join([map.get(i, default_rack) for i in sys.argv[1:]])) return 0 if __name__ == "__main__":
Created on 11-08-2018 05:58 PM - edited 11-08-2018 06:02 PM
Hi,
I'm cluster manager,and CDH version is 5.7.2.
The same trouble,If I can change some params in CM to solve this problem.
Created 12-10-2018 09:23 AM
One easy option in old CDH versions is just to change the shebang at the beginning of the script to:
#!/usr/bin/env python2
Actually this will be more precise than the default, because the script is actually a python 2 script and ... python 3 is coming ... so it is good to be specific about the given version of the interpreter the script needs. We could be even more explicit and require python2.7.