Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Use of Python version 3 scripts for pyspark with HDP 2.4

Solved Go to solution
Highlighted

Use of Python version 3 scripts for pyspark with HDP 2.4

New Contributor

Hello,

We have a cluster with HDP 2.4.2.0 and we face an issue when running a Python 3 script that use spark-submit in spark-client mode. When Spark is activated a Python exeception is raised on the hdp-select and whe could deduce that is a Python version 2 vs version 3 problem.

And subsequent question, is there any trick or a rigth way to have Python 3 scripts with pyspark in HDP ?

See with the following trace :

File "/usr/bin/hdp-select", line 202
  print "ERROR: Invalid package - " + name
  ^
SyntaxError: Missing parentheses in call to 'print'

[...]

WARN ScriptBasedMapping: Exception running /etc/hadoop/conf/topology_script.py 172.28.15.90 
ExitCodeException exitCode=1:  File "/etc/hadoop/conf/topology_script.py", line 62
  print rack
  ^
SyntaxError: Missing parentheses in call to 'print'

   at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
   at org.apache.hadoop.util.Shell.run(Shell.java:487)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
   at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
   at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
   at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
   at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
   at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
   at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
   at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:292)
   at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:284)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:284)

   at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:196)

   at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
   at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
   at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

New Contributor

Hi @Artem Ervits , @Michael Young, thanks for your replies,

after more investigation we found that the issue mentioned is not critical for our results : it seems to be raised by the HDP underlying stack and only pollute our logs. We found our correct results in the mess and could continue our devs.

PYSPARK_PYTHON, LD_LIBRARY_PATH are correctely set. We found a problem with PYTHONHASHSEED, but corrected by setting it with a value.

So, I could mark thread as resolved, but how could you explain the typical Pyhton version error (the 'Syntax Error' on 'print' without parenthesis in hdp-select code) from HDP stack code ? Could it be some adherence from HDP into Spark / Yarn with other HDP stack modules that break Python 3 compatibility ?

8 REPLIES 8

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

Mentor

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

Have you tried setting the PYSPARK_PYTHON environment variable?

export PYSPARK_PYTHON=/usr/local/bin/python3.3

Here is the documentation for configuration information: https://spark.apache.org/docs/1.6.0/configuration.html#environment-variables

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

New Contributor

Hi @Artem Ervits , @Michael Young, thanks for your replies,

after more investigation we found that the issue mentioned is not critical for our results : it seems to be raised by the HDP underlying stack and only pollute our logs. We found our correct results in the mess and could continue our devs.

PYSPARK_PYTHON, LD_LIBRARY_PATH are correctely set. We found a problem with PYTHONHASHSEED, but corrected by setting it with a value.

So, I could mark thread as resolved, but how could you explain the typical Pyhton version error (the 'Syntax Error' on 'print' without parenthesis in hdp-select code) from HDP stack code ? Could it be some adherence from HDP into Spark / Yarn with other HDP stack modules that break Python 3 compatibility ?

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

New Contributor

Hi Toral,

Can you please explain me in clear how you solve this error. because I'm getting the same error.

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

New Contributor

I had this issue, I modified the imports section of the topology_script to be Python 3 compatible:

from __future__ import print_function
import sys, os
try:
    from string import join
except ImportError:
    join = lambda s: " ".join(s)
try:
    import ConfigParser
except ModuleNotFoundError:
    import configparser as ConfigParser

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

New Contributor

We've done this and changed the HDFS configuration in Ambari to have

net.topology.script.file.name=/etc/hadoop/conf/topology_script.py

The only problem is that when we restart HDFS this file gets overwritten. How do I stop this behaviour?

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

New Contributor

Dr. Breitweg, you'll need to make the change with Ambari rather than manually editing the config file, please refer to the following page

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.0.3/bk_ambari-operations/content/set_rack_id_in...

Re: Use of Python version 3 scripts for pyspark with HDP 2.4

New Contributor

It works in HDP-3.1.0.0 and python 3.7.

Thanks! you've saved my day