Support Questions

Find answers, ask questions, and share your expertise

PYSPARK with different python versions on yarn is failing with errors.

Hi, we have hdp 2.3.4 with python 2.6.6 installed on our cluster. PYSPARK works perfectly with 2.6.6 version. We have a use case to use pandas package and for that we need python3. So we have installed python 3.4 in a different location and updated the below variables in spark-env.sh

export PYSPARK_PYTHON=/opt/rh/rh-python34/root/usr/bin/python3.4 

export LD_LIBRARY_PATH=/opt/rh/rh-python34/root/usr/lib64/ 

export PYTHONHASHSEED=0

We are using below spark-submit command to submit the jobs.

/usr/hdp/current/spark-client/bin/spark-submit --master yarn-client --num-executors 10 --conf spark.executor.memory=1g --conf spark.yarn.queue=batch --conf spark.executor.extraLibraryPath=/opt/rh/rh-python34/root/usr/lib64  src/main/python/test.py

We are getting below error when executing the job. Can you please let me know if there is any other way of accomplishing this?

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 78 in stage 4.0 failed 4 times, most recent failure: Lost task 78.3 in stage 4.0 (TID 1585, lxhdpwrktst006.lowes.com): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
    for obj in iterator:
  File "/usr/hdp/2.3.4.0-3485/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1704, in add_shuffle_key
  File "/grid10/hadoop/yarn/log/usercache/hdpbatch/appcache/application_1453942498577_5382/container_e28_1453942498577_5382_01_000058/pyspark.zip/pyspark/rdd.py", line 74, in portable_hash
    raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")
Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
at org.apache.spark.api.python.PythonRunner$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
1 ACCEPTED SOLUTION

Cloudera Employee

Hi @Goutham Koneru

The issue here is we need to pass PYTHONHASHSEED=0 to the executors as an environment variable.

One way to do that is to export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0 and then invoke spark-submit or pyspark.

With this change, my pyspark repro that used to hit this error runs successfully.

export PYSPARK_PYTHON=/usr/local/bin/python3.3

export PYTHONHASHSEED=0

export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

bin/pyspark --master yarn-client --executor-memory 512m

n = sc.parallelize(range(1000)).map(str).countApproxDistinct()

View solution in original post

12 REPLIES 12

Mentor
@Goutham Koneru

did you install python3 and pandas on every host that runs Spark?

@Artem Ervits Yes they are installed on all the nodes in cluster.

Mentor

@Artem Ervits PYTHON_SPARK variable is already set in spark-env.sh and is loaded by default now. I have also tried submitting the job by mentioning it with spark-submit but same error.

PYSPARK_PYTHON=/opt/rh/rh-python34/root/usr/bin/python3 /usr/hdp/current/spark-client/bin/spark-submit --master yarn-client --num-executors 120 --conf spark.executor.memory=1g --conf spark.yarn.queue=batch --conf spark.executor.extraLibraryPath=/opt/rh/rh-python34/root/usr/lib64  src/main/python/test.py

Mentor

Tagging experts @Joseph Niemiec and @vshukla

Cloudera Employee

Hi @Goutham Koneru

The issue here is we need to pass PYTHONHASHSEED=0 to the executors as an environment variable.

One way to do that is to export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0 and then invoke spark-submit or pyspark.

With this change, my pyspark repro that used to hit this error runs successfully.

export PYSPARK_PYTHON=/usr/local/bin/python3.3

export PYTHONHASHSEED=0

export SPARK_YARN_USER_ENV=PYTHONHASHSEED=0

bin/pyspark --master yarn-client --executor-memory 512m

n = sc.parallelize(range(1000)).map(str).countApproxDistinct()

@Ram Venkatesh Thank you it worked. This environment variable is mentioned in 1.0.0 guide but not in 1.5.2. Thank you for pointing it out.

Mentor

Mentor

@Goutham Koneru can you share a link where you found this?

Mentor

@Goutham Koneru thank you very much!

Expert Contributor
@Goutham Koneru

@Artem Ervit @Ram Venkatesh

Hi, i am trying to install python 3 too in my hdp 2.5.3 cluster? how does this affect the other copoennts other than spark? is this recommended to do in production? can i use anaconda instead?

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.