Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

PYSPARK with different python versions on yarn is failing with errors.

SOLVED Go to solution
Highlighted

PYSPARK with different python versions on yarn is failing with errors.

New Contributor

Hi, we have hdp 2.3.4 with python 2.6.6 installed on our cluster. PYSPARK works perfectly with 2.6.6 version. We have a use case to use pandas package and for that we need python3. So we have installed python 3.4 in a different location and updated the below variables in spark-env.sh

export PYSPARK_PYTHON=/opt/rh/rh-python34/root/usr/bin/python3.4 

export LD_LIBRARY_PATH=/opt/rh/rh-python34/root/usr/lib64/ 

export PYTHONHASHSEED=0

We are using below spark-submit command to submit the jobs.

/usr/hdp/current/spark-client/bin/spark-submit --master yarn-client --num-executors 10 --conf spark.executor.memory=1g --conf spark.yarn.queue=batch --conf spark.executor.extraLibraryPath=/opt/rh/rh-python34/root/usr/lib64  src/main/python/test.py

We are getting below error when executing the job. Can you please let me know if there is any other way of accomplishing this?

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 78 in stage 4.0 failed 4 times, most recent failure: Lost task 78.3 in stage 4.0 (TID 1585, lxhdpwrktst006.lowes.com): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
    for obj in iterator:
  File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1704, in add_shuffle_key
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/rdd.py", line 74, in portable_hash
    raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")
Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
at org.apache.spark.api.python.PythonRunner$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
1 ACCEPTED SOLUTION

Accepted Solutions

Re: PYSPARK with different python versions on yarn is failing with errors.

New Contributor

Hi @Goutham Koneru

The issue here is we need to pass PYTHONHASHSEED=0 to the executors as an environment variable.

One way to do that is to export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0 and then invoke spark-submit or pyspark.

With this change, my pyspark repro that used to hit this error runs successfully.

export PYSPARK_PYTHON=/usr/local/bin/python3.3

export PYTHONHASHSEED=0

export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

bin/pyspark --master yarn-client --executor-memory 512m

n = sc.parallelize(range(1000)).map(str).countApproxDistinct()

12 REPLIES 12

Re: PYSPARK with different python versions on yarn is failing with errors.

Mentor
@Goutham Koneru

did you install python3 and pandas on every host that runs Spark?

Re: PYSPARK with different python versions on yarn is failing with errors.

New Contributor

@Artem Ervits Yes they are installed on all the nodes in cluster.

Re: PYSPARK with different python versions on yarn is failing with errors.

Mentor

Re: PYSPARK with different python versions on yarn is failing with errors.

New Contributor

@Artem Ervits PYTHON_SPARK variable is already set in spark-env.sh and is loaded by default now. I have also tried submitting the job by mentioning it with spark-submit but same error.

PYSPARK_PYTHON=/opt/rh/rh-python34/root/usr/bin/python3 /usr/hdp/current/spark-client/bin/spark-submit --master yarn-client --num-executors 120 --conf spark.executor.memory=1g --conf spark.yarn.queue=batch --conf spark.executor.extraLibraryPath=/opt/rh/rh-python34/root/usr/lib64  src/main/python/test.py

Re: PYSPARK with different python versions on yarn is failing with errors.

Mentor

Tagging experts @Joseph Niemiec and @vshukla

Re: PYSPARK with different python versions on yarn is failing with errors.

New Contributor

Hi @Goutham Koneru

The issue here is we need to pass PYTHONHASHSEED=0 to the executors as an environment variable.

One way to do that is to export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0 and then invoke spark-submit or pyspark.

With this change, my pyspark repro that used to hit this error runs successfully.

export PYSPARK_PYTHON=/usr/local/bin/python3.3

export PYTHONHASHSEED=0

export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

bin/pyspark --master yarn-client --executor-memory 512m

n = sc.parallelize(range(1000)).map(str).countApproxDistinct()

Re: PYSPARK with different python versions on yarn is failing with errors.

New Contributor

@Ram Venkatesh Thank you it worked. This environment variable is mentioned in 1.0.0 guide but not in 1.5.2. Thank you for pointing it out.

Re: PYSPARK with different python versions on yarn is failing with errors.

Mentor

Re: PYSPARK with different python versions on yarn is failing with errors.

Mentor

@Goutham Koneru can you share a link where you found this?