question PYSPARK with different python versions on yarn is failing with errors. in Archives of Support Questions (Read Only)

PYSPARK with different python versions on yarn is failing with errors.

goutham_koneru — Thu, 11 Feb 2016 02:38:39 GMT

Hi, we have hdp 2.3.4 with python 2.6.6 installed on our cluster. PYSPARK works perfectly with 2.6.6 version. We have a use case to use pandas package and for that we need python3. So we have installed python 3.4 in a different location and updated the below variables in spark-env.sh

export PYSPARK_PYTHON=/opt/rh/rh-python34/root/usr/bin/python3.4 

export LD_LIBRARY_PATH=/opt/rh/rh-python34/root/usr/lib64/ 

export PYTHONHASHSEED=0

We are using below spark-submit command to submit the jobs.

/usr/hdp/current/spark-client/bin/spark-submit --master yarn-client --num-executors 10 --conf spark.executor.memory=1g --conf spark.yarn.queue=batch --conf spark.executor.extraLibraryPath=/opt/rh/rh-python34/root/usr/lib64  src/main/python/test.py

We are getting below error when executing the job. Can you please let me know if there is any other way of accomplishing this?

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 78 in stage 4.0 failed 4 times, most recent failure: Lost task 78.3 in stage 4.0 (TID 1585, lxhdpwrktst006.lowes.com): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
    for obj in iterator:
  File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1704, in add_shuffle_key
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/rdd.py", line 74, in portable_hash
    raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")
Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
at org.apache.spark.api.python.PythonRunner$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

Re: PYSPARK with different python versions on yarn is failing with errors.

aervits — Thu, 11 Feb 2016 02:40:31 GMT

@Goutham Koneru

did you install python3 and pandas on every host that runs Spark?

Re: PYSPARK with different python versions on yarn is failing with errors.

goutham_koneru — Thu, 11 Feb 2016 02:42:56 GMT

@Artem Ervits Yes they are installed on all the nodes in cluster.

Re: PYSPARK with different python versions on yarn is failing with errors.

aervits — Thu, 11 Feb 2016 02:45:07 GMT

@Goutham Koneru what about this? http://stackoverflow.com/questions/30279783/apache-spark-how-to-use-pyspark-with-python-3

Re: PYSPARK with different python versions on yarn is failing with errors.

goutham_koneru — Thu, 11 Feb 2016 02:54:39 GMT

@Artem Ervits PYTHON_SPARK variable is already set in spark-env.sh and is loaded by default now. I have also tried submitting the job by mentioning it with spark-submit but same error.

PYSPARK_PYTHON=/opt/rh/rh-python34/root/usr/bin/python3 /usr/hdp/current/spark-client/bin/spark-submit --master yarn-client --num-executors 120 --conf spark.executor.memory=1g --conf spark.yarn.queue=batch --conf spark.executor.extraLibraryPath=/opt/rh/rh-python34/root/usr/lib64  src/main/python/test.py

Re: PYSPARK with different python versions on yarn is failing with errors.

aervits — Thu, 11 Feb 2016 03:19:55 GMT

Tagging experts @Joseph Niemiec and @vshukla

Re: PYSPARK with different python versions on yarn is failing with errors.

rvenkatesh — Sun, 14 Feb 2016 15:35:19 GMT

Hi @Goutham Koneru

The issue here is we need to pass PYTHONHASHSEED=0 to the executors as an environment variable.

One way to do that is to export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0 and then invoke spark-submit or pyspark.

With this change, my pyspark repro that used to hit this error runs successfully.

export PYSPARK_PYTHON=/usr/local/bin/python3.3

export PYTHONHASHSEED=0

export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

bin/pyspark --master yarn-client --executor-memory 512m

n = sc.parallelize(range(1000)).map(str).countApproxDistinct()

Re: PYSPARK with different python versions on yarn is failing with errors.

goutham_koneru — Sun, 14 Feb 2016 22:02:02 GMT

@Ram Venkatesh Thank you it worked. This environment variable is mentioned in 1.0.0 guide but not in 1.5.2. Thank you for pointing it out.

Re: PYSPARK with different python versions on yarn is failing with errors.

aervits — Sun, 14 Feb 2016 22:11:34 GMT

@vshukla

@Ram Sriharsha

Re: PYSPARK with different python versions on yarn is failing with errors.

aervits — Sat, 27 Feb 2016 01:16:46 GMT

@Goutham Koneru can you share a link where you found this?

Re: PYSPARK with different python versions on yarn is failing with errors.

goutham_koneru — Sat, 27 Feb 2016 02:52:01 GMT

http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/running-on-yarn.html

Re: PYSPARK with different python versions on yarn is failing with errors.

aervits — Sat, 27 Feb 2016 02:53:26 GMT

@Goutham Koneru thank you very much!

Re: PYSPARK with different python versions on yarn is failing with errors.

pmj — Sun, 01 Apr 2018 22:13:51 GMT

@Goutham Koneru

@Artem Ervit @Ram Venkatesh

Hi, i am trying to install python 3 too in my hdp 2.5.3 cluster? how does this affect the other copoennts other than spark? is this recommended to do in production? can i use anaconda instead?