Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Use of Python version 3 scripts for pyspark with HDP 2.4

avatar
New Member

Hello,

We have a cluster with HDP 2.4.2.0 and we face an issue when running a Python 3 script that use spark-submit in spark-client mode. When Spark is activated a Python exeception is raised on the hdp-select and whe could deduce that is a Python version 2 vs version 3 problem.

And subsequent question, is there any trick or a rigth way to have Python 3 scripts with pyspark in HDP ?

See with the following trace :

File "/usr/bin/hdp-select", line 202
  print "ERROR: Invalid package - " + name
  ^
SyntaxError: Missing parentheses in call to 'print'

[...]

WARN ScriptBasedMapping: Exception running /etc/hadoop/conf/topology_script.py 172.28.15.90 
ExitCodeException exitCode=1:  File "/etc/hadoop/conf/topology_script.py", line 62
  print rack
  ^
SyntaxError: Missing parentheses in call to 'print'

   at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
   at org.apache.hadoop.util.Shell.run(Shell.java:487)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
   at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
   at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
   at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
   at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
   at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
   at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
   at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:292)
   at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:284)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:284)

   at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:196)

   at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
   at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
   at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
1 ACCEPTED SOLUTION

avatar
New Member

Hi @Artem Ervits , @Michael Young, thanks for your replies,

after more investigation we found that the issue mentioned is not critical for our results : it seems to be raised by the HDP underlying stack and only pollute our logs. We found our correct results in the mess and could continue our devs.

PYSPARK_PYTHON, LD_LIBRARY_PATH are correctely set. We found a problem with PYTHONHASHSEED, but corrected by setting it with a value.

So, I could mark thread as resolved, but how could you explain the typical Pyhton version error (the 'Syntax Error' on 'print' without parenthesis in hdp-select code) from HDP stack code ? Could it be some adherence from HDP into Spark / Yarn with other HDP stack modules that break Python 3 compatibility ?

View solution in original post

8 REPLIES 8

avatar
Master Mentor

avatar
Super Guru

Have you tried setting the PYSPARK_PYTHON environment variable?

export PYSPARK_PYTHON=/usr/local/bin/python3.3

Here is the documentation for configuration information: https://spark.apache.org/docs/1.6.0/configuration.html#environment-variables

avatar
New Member

Hi @Artem Ervits , @Michael Young, thanks for your replies,

after more investigation we found that the issue mentioned is not critical for our results : it seems to be raised by the HDP underlying stack and only pollute our logs. We found our correct results in the mess and could continue our devs.

PYSPARK_PYTHON, LD_LIBRARY_PATH are correctely set. We found a problem with PYTHONHASHSEED, but corrected by setting it with a value.

So, I could mark thread as resolved, but how could you explain the typical Pyhton version error (the 'Syntax Error' on 'print' without parenthesis in hdp-select code) from HDP stack code ? Could it be some adherence from HDP into Spark / Yarn with other HDP stack modules that break Python 3 compatibility ?

avatar
New Member

Hi Toral,

Can you please explain me in clear how you solve this error. because I'm getting the same error.

avatar
New Member

I had this issue, I modified the imports section of the topology_script to be Python 3 compatible:

from __future__ import print_function
import sys, os
try:
    from string import join
except ImportError:
    join = lambda s: " ".join(s)
try:
    import ConfigParser
except ModuleNotFoundError:
    import configparser as ConfigParser

avatar
New Member

We've done this and changed the HDFS configuration in Ambari to have

net.topology.script.file.name=/etc/hadoop/conf/topology_script.py

The only problem is that when we restart HDFS this file gets overwritten. How do I stop this behaviour?

avatar
New Member

Dr. Breitweg, you'll need to make the change with Ambari rather than manually editing the config file, please refer to the following page

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.0.3/bk_ambari-operations/content/set_rack_id_in...

avatar
New Member

It works in HDP-3.1.0.0 and python 3.7.

Thanks! you've saved my day