<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: HDP 2.5 - Yarn - Spark 2 - No module named pyspark in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDP-2-5-Yarn-Spark-2-No-module-named-pyspark/m-p/122959#M47214</link>
    <description>&lt;P&gt;It seems that by removing  the call to master('yarn') when building the SparkSession, the issue is gone.&lt;/P&gt;</description>
    <pubDate>Sun, 27 Nov 2016 05:34:50 GMT</pubDate>
    <dc:creator>peter_coppens</dc:creator>
    <dc:date>2016-11-27T05:34:50Z</dc:date>
    <item>
      <title>HDP 2.5 - Yarn - Spark 2 - No module named pyspark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDP-2-5-Yarn-Spark-2-No-module-named-pyspark/m-p/122958#M47213</link>
      <description>&lt;P&gt;
	Hello&lt;/P&gt;&lt;P&gt;
	I am trying to port a spark application from hdp2.3 to hdp2.5 and switch to spark2.&lt;/P&gt;&lt;P&gt;
	I always seem to run into an issue where the worker(s) cannot find pyspark&lt;/P&gt;&lt;PRE&gt;Traceback (most recent call last):

  File "t.py", line 14, in &amp;lt;module&amp;gt;

    print (imsi_stayingtime.collect())

  File "/usr/hdp/current/spark2-client/python/pyspark/rdd.py", line 776, in collect

    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())

  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__

  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco

    return f(*a, **kw)

  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 22, ip-10-0-0-61.eu-west-1.compute.internal): org.apache.spark.SparkException: 

Error from python worker:

  /usr/bin/python: No module named pyspark

PYTHONPATH was:

  /hadoop/yarn/local/filecache/13/spark2-hdp-yarn-archive.tar.gz/spark-core_2.11-2.0.0.2.5.0.0-1245.jar
&lt;/PRE&gt;&lt;P&gt;I can easily reproduce it with a very simple hive/spark test app&lt;/P&gt;&lt;P&gt;e.g.&lt;/P&gt;&lt;PRE&gt;import pyspark
from pyspark.sql import SparkSession
from operator import add

spark = SparkSession \
    .builder \
    .master('yarn') \
    .appName("Ttt...111") \
    .enableHiveSupport() \
    .getOrCreate()

report = spark.sql("select imsi,tacs,sum(estimated_staying_time) as total_group_stayingtime from ... where ... group by tacs,imsi")

imsi_stayingtime = report.select('imsi','total_group_stayingtime').rdd.reduceByKey(add)
print (imsi_stayingtime.collect())
&lt;/PRE&gt;&lt;P&gt;I tried to add the zip files (addPyFile), change the enviroment shell file and change spark.yarn.dist.files but nothing seems to help&lt;/P&gt;&lt;P&gt;All tips are extremely welcome indeed!&lt;/P&gt;&lt;P&gt;Tx&lt;/P&gt;&lt;P&gt;Peter&lt;/P&gt;</description>
      <pubDate>Sun, 27 Nov 2016 00:17:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDP-2-5-Yarn-Spark-2-No-module-named-pyspark/m-p/122958#M47213</guid>
      <dc:creator>peter_coppens</dc:creator>
      <dc:date>2016-11-27T00:17:37Z</dc:date>
    </item>
    <item>
      <title>Re: HDP 2.5 - Yarn - Spark 2 - No module named pyspark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDP-2-5-Yarn-Spark-2-No-module-named-pyspark/m-p/122959#M47214</link>
      <description>&lt;P&gt;It seems that by removing  the call to master('yarn') when building the SparkSession, the issue is gone.&lt;/P&gt;</description>
      <pubDate>Sun, 27 Nov 2016 05:34:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/HDP-2-5-Yarn-Spark-2-No-module-named-pyspark/m-p/122959#M47214</guid>
      <dc:creator>peter_coppens</dc:creator>
      <dc:date>2016-11-27T05:34:50Z</dc:date>
    </item>
  </channel>
</rss>

