I am trying to port a spark application from hdp2.3 to hdp2.5 and switch to spark2.
I always seem to run into an issue where the worker(s) cannot find pyspark
Traceback (most recent call last): File "t.py", line 14, in <module> print (imsi_stayingtime.collect()) File "/usr/hdp/current/spark2-client/python/pyspark/rdd.py", line 776, in collect port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 22, ip-10-0-0-61.eu-west-1.compute.internal): org.apache.spark.SparkException: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /hadoop/yarn/local/filecache/13/spark2-hdp-yarn-archive.tar.gz/spark-core_2.11-184.108.40.206.5.0.0-1245.jar
I can easily reproduce it with a very simple hive/spark test app
import pyspark from pyspark.sql import SparkSession from operator import add spark = SparkSession \ .builder \ .master('yarn') \ .appName("Ttt...111") \ .enableHiveSupport() \ .getOrCreate() report = spark.sql("select imsi,tacs,sum(estimated_staying_time) as total_group_stayingtime from ... where ... group by tacs,imsi") imsi_stayingtime = report.select('imsi','total_group_stayingtime').rdd.reduceByKey(add) print (imsi_stayingtime.collect())
I tried to add the zip files (addPyFile), change the enviroment shell file and change spark.yarn.dist.files but nothing seems to help
All tips are extremely welcome indeed!