Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

HDP 2.5 - Yarn - Spark 2 - No module named pyspark

avatar

Hello

I am trying to port a spark application from hdp2.3 to hdp2.5 and switch to spark2.

I always seem to run into an issue where the worker(s) cannot find pyspark

Traceback (most recent call last):

  File "t.py", line 14, in <module>

    print (imsi_stayingtime.collect())

  File "/usr/hdp/current/spark2-client/python/pyspark/rdd.py", line 776, in collect

    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())

  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__

  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco

    return f(*a, **kw)

  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 22, ip-10-0-0-61.eu-west-1.compute.internal): org.apache.spark.SparkException: 

Error from python worker:

  /usr/bin/python: No module named pyspark

PYTHONPATH was:

  /hadoop/yarn/local/filecache/13/spark2-hdp-yarn-archive.tar.gz/spark-core_2.11-2.0.0.2.5.0.0-1245.jar

I can easily reproduce it with a very simple hive/spark test app

e.g.

import pyspark
from pyspark.sql import SparkSession
from operator import add

spark = SparkSession \
    .builder \
    .master('yarn') \
    .appName("Ttt...111") \
    .enableHiveSupport() \
    .getOrCreate()

report = spark.sql("select imsi,tacs,sum(estimated_staying_time) as total_group_stayingtime from ... where ... group by tacs,imsi")

imsi_stayingtime = report.select('imsi','total_group_stayingtime').rdd.reduceByKey(add)
print (imsi_stayingtime.collect())

I tried to add the zip files (addPyFile), change the enviroment shell file and change spark.yarn.dist.files but nothing seems to help

All tips are extremely welcome indeed!

Tx

Peter

1 ACCEPTED SOLUTION

avatar

It seems that by removing the call to master('yarn') when building the SparkSession, the issue is gone.

View solution in original post

1 REPLY 1

avatar

It seems that by removing the call to master('yarn') when building the SparkSession, the issue is gone.