Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HDP 2.5 - Yarn - Spark 2 - No module named pyspark

Solved Go to solution

HDP 2.5 - Yarn - Spark 2 - No module named pyspark

New Contributor

Hello

I am trying to port a spark application from hdp2.3 to hdp2.5 and switch to spark2.

I always seem to run into an issue where the worker(s) cannot find pyspark

Traceback (most recent call last):

  File "t.py", line 14, in <module>

    print (imsi_stayingtime.collect())

  File "/usr/hdp/current/spark2-client/python/pyspark/rdd.py", line 776, in collect

    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())

  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__

  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco

    return f(*a, **kw)

  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 22, ip-10-0-0-61.eu-west-1.compute.internal): org.apache.spark.SparkException: 

Error from python worker:

  /usr/bin/python: No module named pyspark

PYTHONPATH was:

  /hadoop/yarn/local/filecache/13/spark2-hdp-yarn-archive.tar.gz/spark-core_2.11-2.0.0.2.5.0.0-1245.jar

I can easily reproduce it with a very simple hive/spark test app

e.g.

import pyspark
from pyspark.sql import SparkSession
from operator import add

spark = SparkSession \
    .builder \
    .master('yarn') \
    .appName("Ttt...111") \
    .enableHiveSupport() \
    .getOrCreate()

report = spark.sql("select imsi,tacs,sum(estimated_staying_time) as total_group_stayingtime from ... where ... group by tacs,imsi")

imsi_stayingtime = report.select('imsi','total_group_stayingtime').rdd.reduceByKey(add)
print (imsi_stayingtime.collect())

I tried to add the zip files (addPyFile), change the enviroment shell file and change spark.yarn.dist.files but nothing seems to help

All tips are extremely welcome indeed!

Tx

Peter

1 ACCEPTED SOLUTION

Accepted Solutions

Re: HDP 2.5 - Yarn - Spark 2 - No module named pyspark

New Contributor

It seems that by removing the call to master('yarn') when building the SparkSession, the issue is gone.

1 REPLY 1

Re: HDP 2.5 - Yarn - Spark 2 - No module named pyspark

New Contributor

It seems that by removing the call to master('yarn') when building the SparkSession, the issue is gone.