Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Use of Python version 3 scripts for pyspark with HDP 2.4

avatar
New Contributor

Hello,

We have a cluster with HDP 2.4.2.0 and we face an issue when running a Python 3 script that use spark-submit in spark-client mode. When Spark is activated a Python exeception is raised on the hdp-select and whe could deduce that is a Python version 2 vs version 3 problem.

And subsequent question, is there any trick or a rigth way to have Python 3 scripts with pyspark in HDP ?

See with the following trace :

File "/usr/bin/hdp-select", line 202
  print "ERROR: Invalid package - " + name
  ^
SyntaxError: Missing parentheses in call to 'print'

[...]

WARN ScriptBasedMapping: Exception running /etc/hadoop/conf/topology_script.py 172.28.15.90 
ExitCodeException exitCode=1:  File "/etc/hadoop/conf/topology_script.py", line 62
  print rack
  ^
SyntaxError: Missing parentheses in call to 'print'

   at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
   at org.apache.hadoop.util.Shell.run(Shell.java:487)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
   at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
   at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
   at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
   at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
   at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
   at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
   at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:292)
   at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:284)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:284)

   at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:196)

   at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
   at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
   at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
1 ACCEPTED SOLUTION

avatar
New Contributor

Hi @Artem Ervits , @Michael Young, thanks for your replies,

after more investigation we found that the issue mentioned is not critical for our results : it seems to be raised by the HDP underlying stack and only pollute our logs. We found our correct results in the mess and could continue our devs.

PYSPARK_PYTHON, LD_LIBRARY_PATH are correctely set. We found a problem with PYTHONHASHSEED, but corrected by setting it with a value.

So, I could mark thread as resolved, but how could you explain the typical Pyhton version error (the 'Syntax Error' on 'print' without parenthesis in hdp-select code) from HDP stack code ? Could it be some adherence from HDP into Spark / Yarn with other HDP stack modules that break Python 3 compatibility ?

View solution in original post

8 REPLIES 8

avatar
Master Mentor

avatar
Super Guru

Have you tried setting the PYSPARK_PYTHON environment variable?

export PYSPARK_PYTHON=/usr/local/bin/python3.3

Here is the documentation for configuration information: https://spark.apache.org/docs/1.6.0/configuration.html#environment-variables

avatar
New Contributor

Hi @Artem Ervits , @Michael Young, thanks for your replies,

after more investigation we found that the issue mentioned is not critical for our results : it seems to be raised by the HDP underlying stack and only pollute our logs. We found our correct results in the mess and could continue our devs.

PYSPARK_PYTHON, LD_LIBRARY_PATH are correctely set. We found a problem with PYTHONHASHSEED, but corrected by setting it with a value.

So, I could mark thread as resolved, but how could you explain the typical Pyhton version error (the 'Syntax Error' on 'print' without parenthesis in hdp-select code) from HDP stack code ? Could it be some adherence from HDP into Spark / Yarn with other HDP stack modules that break Python 3 compatibility ?

avatar
New Contributor

Hi Toral,

Can you please explain me in clear how you solve this error. because I'm getting the same error.

avatar
Contributor

I had this issue, I modified the imports section of the topology_script to be Python 3 compatible:

from __future__ import print_function
import sys, os
try:
    from string import join
except ImportError:
    join = lambda s: " ".join(s)
try:
    import ConfigParser
except ModuleNotFoundError:
    import configparser as ConfigParser

avatar
Contributor

We've done this and changed the HDFS configuration in Ambari to have

net.topology.script.file.name=/etc/hadoop/conf/topology_script.py

The only problem is that when we restart HDFS this file gets overwritten. How do I stop this behaviour?

avatar
Contributor

Dr. Breitweg, you'll need to make the change with Ambari rather than manually editing the config file, please refer to the following page

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.0.3/bk_ambari-operations/content/set_rack_id_in...

avatar
Explorer

It works in HDP-3.1.0.0 and python 3.7.

Thanks! you've saved my day