Support Questions

fabien_toral · ‎07-21-2016

Hello,

We have a cluster with HDP 2.4.2.0 and we face an issue when running a Python 3 script that use spark-submit in spark-client mode. When Spark is activated a Python exeception is raised on the hdp-select and whe could deduce that is a Python version 2 vs version 3 problem.

And subsequent question, is there any trick or a rigth way to have Python 3 scripts with pyspark in HDP ?

See with the following trace :

File "/usr/bin/hdp-select", line 202
  print "ERROR: Invalid package - " + name
  ^
SyntaxError: Missing parentheses in call to 'print'

[...]

WARN ScriptBasedMapping: Exception running /etc/hadoop/conf/topology_script.py 172.28.15.90 
ExitCodeException exitCode=1:  File "/etc/hadoop/conf/topology_script.py", line 62
  print rack
  ^
SyntaxError: Missing parentheses in call to 'print'

   at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
   at org.apache.hadoop.util.Shell.run(Shell.java:487)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
   at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.runResolveCommand(ScriptBasedMapping.java:251)
   at org.apache.hadoop.net.ScriptBasedMapping$RawScriptBasedMapping.resolve(ScriptBasedMapping.java:188)
   at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:119)
   at org.apache.hadoop.yarn.util.RackResolver.coreResolve(RackResolver.java:101)
   at org.apache.hadoop.yarn.util.RackResolver.resolve(RackResolver.java:81)
   at org.apache.spark.scheduler.cluster.YarnScheduler.getRackForHost(YarnScheduler.scala:38)
   at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:292)
   at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$1.apply(TaskSchedulerImpl.scala:284)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:284)

   at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:196)

   at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:123)
   at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
   at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

fabien_toral · ‎07-26-2016

Hi @Artem Ervits , @Michael Young, thanks for your replies,

after more investigation we found that the issue mentioned is not critical for our results : it seems to be raised by the HDP underlying stack and only pollute our logs. We found our correct results in the mess and could continue our devs.

PYSPARK_PYTHON, LD_LIBRARY_PATH are correctely set. We found a problem with PYTHONHASHSEED, but corrected by setting it with a value.

So, I could mark thread as resolved, but how could you explain the typical Pyhton version error (the 'Syntax Error' on 'print' without parenthesis in hdp-select code) from HDP stack code ? Could it be some adherence from HDP into Spark / Yarn with other HDP stack modules that break Python 3 compatibility ?

View solution in original post

aervits · ‎07-21-2016

Have you tried solution in this thread? https://community.hortonworks.com/questions/16094/pyspark-with-different-python-versions-on-yarn-is....

myoung · ‎07-25-2016

Have you tried setting the PYSPARK_PYTHON environment variable?

export PYSPARK_PYTHON=/usr/local/bin/python3.3

Here is the documentation for configuration information: https://spark.apache.org/docs/1.6.0/configuration.html#environment-variables

fabien_toral · ‎07-26-2016

Hi @Artem Ervits , @Michael Young, thanks for your replies,

after more investigation we found that the issue mentioned is not critical for our results : it seems to be raised by the HDP underlying stack and only pollute our logs. We found our correct results in the mess and could continue our devs.

PYSPARK_PYTHON, LD_LIBRARY_PATH are correctely set. We found a problem with PYTHONHASHSEED, but corrected by setting it with a value.

So, I could mark thread as resolved, but how could you explain the typical Pyhton version error (the 'Syntax Error' on 'print' without parenthesis in hdp-select code) from HDP stack code ? Could it be some adherence from HDP into Spark / Yarn with other HDP stack modules that break Python 3 compatibility ?

mhrtejadsmlai · ‎02-16-2019

Hi Toral,

Can you please explain me in clear how you solve this error. because I'm getting the same error.

kdunn926 · ‎03-07-2017

I had this issue, I modified the imports section of the topology_script to be Python 3 compatible:

from __future__ import print_function
import sys, os
try:
    from string import join
except ImportError:
    join = lambda s: " ".join(s)
try:
    import ConfigParser
except ModuleNotFoundError:
    import configparser as ConfigParser

jason_breitweg · ‎06-02-2017

We've done this and changed the HDFS configuration in Ambari to have

net.topology.script.file.name=/etc/hadoop/conf/topology_script.py

The only problem is that when we restart HDFS this file gets overwritten. How do I stop this behaviour?

kdunn926 · ‎06-02-2017

Dr. Breitweg, you'll need to make the change with Ambari rather than manually editing the config file, please refer to the following page

https://docs.hortonworks.com/HDPDocuments/Ambari-2.5.0.3/bk_ambari-operations/content/set_rack_id_in...

moises_c · ‎07-02-2019

It works in HDP-3.1.0.0 and python 3.7.

Thanks! you've saved my day

Cloudera Community

Support Questions

Use of Python version 3 scripts for pyspark with HDP 2.4

PySpark and Python version (<3.6)?

NiFi- Python vs Groovy Script Performance Analysis

Version of Python of Pyspark for Spark2 and Zeppel...

Usage of Python 2.7 version in Pyspark

Using Toad for Hadoop with HDP 2.4

PySpark with Livy via script submission and Zeppel...

Which version of HDP supports spark 2.4

Remove Old Stack Versions script doesnt work in am...

Practice on using ansible 2.4 to deploy HDP 2.6.4....

Ambari2.6.1.3-3 Install HDP Services Failed On HDP...