Support Questions

urilaserson · ‎02-14-2017

I don't know who's responsible for writing topology.py, but it uses Python 2 syntax, so if I try to run PySpark with Python 3 using

export PYSPARK_PYTHON=python3

I get tons of stacktraces.

Uri

Darren · ‎02-14-2017

This was fixed in CM 5.9.

View solution in original post

Darren · ‎02-14-2017

Depending on how you are running the job, you may be able to override the topology script parameter and/or replace the toplogy.py script with one that is python 3 compatible. If you're submitting jobs from the command line, you'd usually copy /etc/hadoop/conf to some custom directory /path/to/customized/conf, make changes there, then set HADOOP_CONF_DIR=/path/to/customized/conf and run your job.

Assuming you can change that topology script, here's the relevant portion of the diff that you can apply:

@@ -1,8 +1,8 @@
#!/usr/bin/env python
#
-# Copyright (c) 2010-2012 Cloudera, Inc. All rights reserved.
+# Copyright (c) 2016 Cloudera, Inc. All rights reserved.
#
- 
+
'''
This script is provided by CMF for hadoop to determine network/rack topology.
It is automatically generated and could be replaced at any time. Any changes
@@ -12,8 +12,13 @@ made to it will be lost when this happens.
import os
import sys
import xml.dom.minidom
-from string import join
- 
+
+try:
+ xrange
+except NameError:
+ # support for python3, which basically renamed xrange to range
+ xrange = range
+
def main():
MAP_FILE = '{{CMF_CONF_DIR}}/topology.map'
DEFAULT_RACK = '/default'
@@ -40,14 +45,14 @@ def main():
map[node.getAttribute("name")] = node.getAttribute("rack")
except:
default_rack = "".join([ DEFAULT_RACK for _ in xrange(max_elements)])
- print default_rack
+ print(default_rack)
return -1
- 
+
default_rack = "".join([ DEFAULT_RACK for _ in xrange(max_elements)])
if len(sys.argv)==1:
- print default_rack
+ print(default_rack)
else:
- print join([map.get(i, default_rack) for i in sys.argv[1:]], " ")
+ print(" ".join([map.get(i, default_rack) for i in sys.argv[1:]]))
return 0

if __name__ == "__main__":

View solution in original post

Darren · ‎02-14-2017

This was fixed in CM 5.9.

urilaserson · ‎02-14-2017

Any workaround for earlier versions? I'm on 5.5 and I don't manage the cluster.

Darren · ‎02-14-2017

Depending on how you are running the job, you may be able to override the topology script parameter and/or replace the toplogy.py script with one that is python 3 compatible. If you're submitting jobs from the command line, you'd usually copy /etc/hadoop/conf to some custom directory /path/to/customized/conf, make changes there, then set HADOOP_CONF_DIR=/path/to/customized/conf and run your job.

Assuming you can change that topology script, here's the relevant portion of the diff that you can apply:

@@ -1,8 +1,8 @@
#!/usr/bin/env python
#
-# Copyright (c) 2010-2012 Cloudera, Inc. All rights reserved.
+# Copyright (c) 2016 Cloudera, Inc. All rights reserved.
#
- 
+
'''
This script is provided by CMF for hadoop to determine network/rack topology.
It is automatically generated and could be replaced at any time. Any changes
@@ -12,8 +12,13 @@ made to it will be lost when this happens.
import os
import sys
import xml.dom.minidom
-from string import join
- 
+
+try:
+ xrange
+except NameError:
+ # support for python3, which basically renamed xrange to range
+ xrange = range
+
def main():
MAP_FILE = '{{CMF_CONF_DIR}}/topology.map'
DEFAULT_RACK = '/default'
@@ -40,14 +45,14 @@ def main():
map[node.getAttribute("name")] = node.getAttribute("rack")
except:
default_rack = "".join([ DEFAULT_RACK for _ in xrange(max_elements)])
- print default_rack
+ print(default_rack)
return -1
- 
+
default_rack = "".join([ DEFAULT_RACK for _ in xrange(max_elements)])
if len(sys.argv)==1:
- print default_rack
+ print(default_rack)
else:
- print join([map.get(i, default_rack) for i in sys.argv[1:]], " ")
+ print(" ".join([map.get(i, default_rack) for i in sys.argv[1:]]))
return 0

if __name__ == "__main__":

zbz · ‎11-08-2018

Hi,

I'm cluster manager,and CDH version is 5.7.2.

The same trouble,If I can change some params in CM to solve this problem.

javicacheiro · ‎12-10-2018

One easy option in old CDH versions is just to change the shebang at the beginning of the script to:

#!/usr/bin/env python2

Actually this will be more precise than the default, because the script is actually a python 2 script and ... python 3 is coming ... so it is good to be specific about the given version of the interpreter the script needs. We could be even more explicit and require python2.7.

Cloudera Community

Support Questions

topology.py not Python 3 compatible

Spark Python Integration Test Result Exceptions

Versions Compatibility

Query Hive Using Python

Configuring CDH cluster with Python 3

Cloudera Data Engineering Spark Job with Python Wh...

How to manage Airflow Python Environments with CDE...

Python 2.7.5-69 compatability with HDP 2.6.4

How to connect to CDP Impala from python

Working with S3 Compatible Data Stores via Apache ...

Feature Releases of Apache Spark 3 minor versions