<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: pyspark toPandas() works locally but fails in cluster in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/pyspark-toPandas-works-locally-but-fails-in-cluster/m-p/346330#M234844</link>
    <description>&lt;DIV class="answercell post-layout--right"&gt;&lt;DIV class="s-prose js-post-body"&gt;&lt;DIV class="votecell post-layout--left"&gt;&lt;DIV class="js-voting-container d-flex jc-center fd-column ai-stretch gs4 fc-black-200"&gt;&lt;SPAN&gt;It seems like your Spark workers are pointing to the default/system installation of python rather than your virtual&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;environment&lt;/SPAN&gt;&lt;SPAN&gt;. By setting the environment variable, you can tell Spark to use your virtual environment.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;You can set the below&amp;nbsp;two configs in &amp;lt;spark_home_dir&amp;gt;/conf/spark-env.sh&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;PRE&gt;export PYSPARK_PYTHON=&amp;lt;Python_&lt;SPAN&gt;binaries_&lt;/SPAN&gt;Path&amp;gt;
export PYSPARK_DRIVER_PYTHON=&amp;lt;Python_&lt;SPAN&gt;binaries_&lt;/SPAN&gt;Path&amp;gt;&lt;/PRE&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Sat, 25 Jun 2022 22:05:23 GMT</pubDate>
    <dc:creator>jagadeesan</dc:creator>
    <dc:date>2022-06-25T22:05:23Z</dc:date>
    <item>
      <title>pyspark toPandas() works locally but fails in cluster</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-toPandas-works-locally-but-fails-in-cluster/m-p/346266#M234818</link>
      <description>&lt;P&gt;&lt;SPAN&gt;I have an app where after doing various processes in pyspark I have a smaller dataset which I need to convert to pandas before uploading to elasticsearch. I have&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;res = result.select("*").toPandas()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;On my local when I use&lt;/SPAN&gt;&lt;/P&gt;&lt;PRE&gt;spark-submit --master &lt;SPAN class="hljs-string"&gt;"local[*]"&lt;/SPAN&gt; app.py&lt;/PRE&gt;&lt;P&gt;It works perfectly fine. I also have a 2 worker cluster, when I run it on my cluster using:&lt;/P&gt;&lt;PRE&gt;spark-submit --master MASTER_IP:&lt;SPAN class="hljs-number"&gt;7077&lt;/SPAN&gt; app.py&lt;/PRE&gt;&lt;P&gt;I get the following error:&lt;/P&gt;&lt;PRE&gt;Traceback (most recent call last):
  File &lt;SPAN class="hljs-string"&gt;"/home/nitinram/Elisity-Master/esaas/ml_model/landspeed_violation_spark/landspeed_violation_spark.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;173&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; &amp;lt;module&amp;gt;
    results = landspeed_calculator(ps_df_compiled,landspeed_params, sc)
  File &lt;SPAN class="hljs-string"&gt;"/home/nitinram/Elisity-Master/esaas/ml_model/landspeed_violation_spark/landspeed_violation_spark_utilities.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;273&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; landspeed_calculator
    res = result.select(&lt;SPAN class="hljs-string"&gt;"*"&lt;/SPAN&gt;).toPandas()
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;110&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; toPandas
    split_batches=self_destruct)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;286&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; _collect_as_arrow
    jsocket_auth_server.getResult()
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;1322&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; __call__
    answer, self.gateway_client, self.target_id, self.name)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;111&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; deco
    &lt;SPAN class="hljs-keyword"&gt;return&lt;/SPAN&gt; f(*a, **kw)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;328&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; get_return_value
    &lt;SPAN class="hljs-built_in"&gt;format&lt;/SPAN&gt;(target_id, &lt;SPAN class="hljs-string"&gt;"."&lt;/SPAN&gt;, name), value)
py4j.protocol.Py4JJavaError: An error occurred &lt;SPAN class="hljs-keyword"&gt;while&lt;/SPAN&gt; calling o2696.getResult.
: org.apache.spark.SparkException: Exception thrown &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:&lt;SPAN class="hljs-number"&gt;301&lt;/SPAN&gt;)
    at org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:&lt;SPAN class="hljs-number"&gt;97&lt;/SPAN&gt;)
    at org.apache.spark.security.SocketAuthServer.getResult(SocketAuthServer.scala:&lt;SPAN class="hljs-number"&gt;93&lt;/SPAN&gt;)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:&lt;SPAN class="hljs-number"&gt;62&lt;/SPAN&gt;)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:&lt;SPAN class="hljs-number"&gt;43&lt;/SPAN&gt;)
    at java.base/java.lang.reflect.Method.invoke(Method.java:&lt;SPAN class="hljs-number"&gt;566&lt;/SPAN&gt;)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:&lt;SPAN class="hljs-number"&gt;244&lt;/SPAN&gt;)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:&lt;SPAN class="hljs-number"&gt;357&lt;/SPAN&gt;)
    at py4j.Gateway.invoke(Gateway.java:&lt;SPAN class="hljs-number"&gt;282&lt;/SPAN&gt;)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:&lt;SPAN class="hljs-number"&gt;132&lt;/SPAN&gt;)
    at py4j.commands.CallCommand.execute(CallCommand.java:&lt;SPAN class="hljs-number"&gt;79&lt;/SPAN&gt;)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:&lt;SPAN class="hljs-number"&gt;182&lt;/SPAN&gt;)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:&lt;SPAN class="hljs-number"&gt;106&lt;/SPAN&gt;)
    at java.base/java.lang.Thread.run(Thread.java:&lt;SPAN class="hljs-number"&gt;829&lt;/SPAN&gt;)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task &lt;SPAN class="hljs-number"&gt;0&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; stage &lt;SPAN class="hljs-number"&gt;206.0&lt;/SPAN&gt; failed &lt;SPAN class="hljs-number"&gt;4&lt;/SPAN&gt; times, most recent failure: Lost task &lt;SPAN class="hljs-number"&gt;0.3&lt;/SPAN&gt; &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; stage &lt;SPAN class="hljs-number"&gt;206.0&lt;/SPAN&gt; (TID &lt;SPAN class="hljs-number"&gt;114&lt;/SPAN&gt;) (&lt;SPAN class="hljs-number"&gt;10.5&lt;/SPAN&gt;&lt;SPAN class="hljs-number"&gt;.24&lt;/SPAN&gt;&lt;SPAN class="hljs-number"&gt;.113&lt;/SPAN&gt; executor &lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;&lt;span class="lia-unicode-emoji" title=":disappointed_face:"&gt;😞&lt;/span&gt; org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/worker.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;603&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/worker.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;449&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/worker.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;251&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/worker.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;71&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; read_command
    command = serializer._read_with_length(file)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;160&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; _read_with_length
    &lt;SPAN class="hljs-keyword"&gt;return&lt;/SPAN&gt; self.loads(obj)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;430&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; loads
    &lt;SPAN class="hljs-keyword"&gt;return&lt;/SPAN&gt; pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named &lt;SPAN class="hljs-string"&gt;'pandas'&lt;/SPAN&gt;

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:&lt;SPAN class="hljs-number"&gt;555&lt;/SPAN&gt;)
    at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$&lt;SPAN class="hljs-number"&gt;1.&lt;/SPAN&gt;read(PythonArrowOutput.scala:&lt;SPAN class="hljs-number"&gt;101&lt;/SPAN&gt;)
    at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$&lt;SPAN class="hljs-number"&gt;1.&lt;/SPAN&gt;read(PythonArrowOutput.scala:&lt;SPAN class="hljs-number"&gt;50&lt;/SPAN&gt;)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:&lt;SPAN class="hljs-number"&gt;508&lt;/SPAN&gt;)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:&lt;SPAN class="hljs-number"&gt;37&lt;/SPAN&gt;)
    at scala.collection.Iterator$$anon$&lt;SPAN class="hljs-number"&gt;11.&lt;/SPAN&gt;hasNext(Iterator.scala:&lt;SPAN class="hljs-number"&gt;491&lt;/SPAN&gt;)
    at scala.collection.Iterator$$anon$&lt;SPAN class="hljs-number"&gt;10.&lt;/SPAN&gt;hasNext(Iterator.scala:&lt;SPAN class="hljs-number"&gt;460&lt;/SPAN&gt;)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage69.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:&lt;SPAN class="hljs-number"&gt;43&lt;/SPAN&gt;)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$&lt;SPAN class="hljs-number"&gt;1.&lt;/SPAN&gt;hasNext(WholeStageCodegenExec.scala:&lt;SPAN class="hljs-number"&gt;759&lt;/SPAN&gt;)
    at scala.collection.Iterator$$anon$&lt;SPAN class="hljs-number"&gt;10.&lt;/SPAN&gt;hasNext(Iterator.scala:&lt;SPAN class="hljs-number"&gt;460&lt;/SPAN&gt;)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:&lt;SPAN class="hljs-number"&gt;140&lt;/SPAN&gt;)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:&lt;SPAN class="hljs-number"&gt;59&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:&lt;SPAN class="hljs-number"&gt;99&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:&lt;SPAN class="hljs-number"&gt;52&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.Task.run(Task.scala:&lt;SPAN class="hljs-number"&gt;131&lt;/SPAN&gt;)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$&lt;SPAN class="hljs-number"&gt;3&lt;/SPAN&gt;(Executor.scala:&lt;SPAN class="hljs-number"&gt;506&lt;/SPAN&gt;)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:&lt;SPAN class="hljs-number"&gt;1462&lt;/SPAN&gt;)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:&lt;SPAN class="hljs-number"&gt;509&lt;/SPAN&gt;)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:&lt;SPAN class="hljs-number"&gt;1128&lt;/SPAN&gt;)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:&lt;SPAN class="hljs-number"&gt;628&lt;/SPAN&gt;)
    at java.base/java.lang.Thread.run(Thread.java:&lt;SPAN class="hljs-number"&gt;829&lt;/SPAN&gt;)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;2454&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$&lt;SPAN class="hljs-number"&gt;2&lt;/SPAN&gt;(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;2403&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$&lt;SPAN class="hljs-number"&gt;2&lt;/SPAN&gt;$adapted(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;2402&lt;/SPAN&gt;)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:&lt;SPAN class="hljs-number"&gt;62&lt;/SPAN&gt;)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:&lt;SPAN class="hljs-number"&gt;55&lt;/SPAN&gt;)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:&lt;SPAN class="hljs-number"&gt;49&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;2402&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;1160&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$&lt;SPAN class="hljs-number"&gt;1&lt;/SPAN&gt;$adapted(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;1160&lt;/SPAN&gt;)
    at scala.Option.foreach(Option.scala:&lt;SPAN class="hljs-number"&gt;407&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;1160&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;2642&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;2584&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:&lt;SPAN class="hljs-number"&gt;2573&lt;/SPAN&gt;)
    at org.apache.spark.util.EventLoop$$anon$&lt;SPAN class="hljs-number"&gt;1.&lt;/SPAN&gt;run(EventLoop.scala:&lt;SPAN class="hljs-number"&gt;49&lt;/SPAN&gt;)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/worker.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;603&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/worker.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;449&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; read_udfs
    udfs.append(read_single_udf(pickleSer, infile, eval_type, runner_conf, udf_index=i))
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/worker.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;251&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/worker.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;71&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; read_command
    command = serializer._read_with_length(file)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;160&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; _read_with_length
    &lt;SPAN class="hljs-keyword"&gt;return&lt;/SPAN&gt; self.loads(obj)
  File &lt;SPAN class="hljs-string"&gt;"/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py"&lt;/SPAN&gt;, line &lt;SPAN class="hljs-number"&gt;430&lt;/SPAN&gt;, &lt;SPAN class="hljs-keyword"&gt;in&lt;/SPAN&gt; loads
    &lt;SPAN class="hljs-keyword"&gt;return&lt;/SPAN&gt; pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named &lt;SPAN class="hljs-string"&gt;'pandas'&lt;/SPAN&gt;

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:&lt;SPAN class="hljs-number"&gt;555&lt;/SPAN&gt;)
    at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$&lt;SPAN class="hljs-number"&gt;1.&lt;/SPAN&gt;read(PythonArrowOutput.scala:&lt;SPAN class="hljs-number"&gt;101&lt;/SPAN&gt;)
    at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$&lt;SPAN class="hljs-number"&gt;1.&lt;/SPAN&gt;read(PythonArrowOutput.scala:&lt;SPAN class="hljs-number"&gt;50&lt;/SPAN&gt;)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:&lt;SPAN class="hljs-number"&gt;508&lt;/SPAN&gt;)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:&lt;SPAN class="hljs-number"&gt;37&lt;/SPAN&gt;)
    at scala.collection.Iterator$$anon$&lt;SPAN class="hljs-number"&gt;11.&lt;/SPAN&gt;hasNext(Iterator.scala:&lt;SPAN class="hljs-number"&gt;491&lt;/SPAN&gt;)
    at scala.collection.Iterator$$anon$&lt;SPAN class="hljs-number"&gt;10.&lt;/SPAN&gt;hasNext(Iterator.scala:&lt;SPAN class="hljs-number"&gt;460&lt;/SPAN&gt;)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage69.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:&lt;SPAN class="hljs-number"&gt;43&lt;/SPAN&gt;)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$&lt;SPAN class="hljs-number"&gt;1.&lt;/SPAN&gt;hasNext(WholeStageCodegenExec.scala:&lt;SPAN class="hljs-number"&gt;759&lt;/SPAN&gt;)
    at scala.collection.Iterator$$anon$&lt;SPAN class="hljs-number"&gt;10.&lt;/SPAN&gt;hasNext(Iterator.scala:&lt;SPAN class="hljs-number"&gt;460&lt;/SPAN&gt;)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:&lt;SPAN class="hljs-number"&gt;140&lt;/SPAN&gt;)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:&lt;SPAN class="hljs-number"&gt;59&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:&lt;SPAN class="hljs-number"&gt;99&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:&lt;SPAN class="hljs-number"&gt;52&lt;/SPAN&gt;)
    at org.apache.spark.scheduler.Task.run(Task.scala:&lt;SPAN class="hljs-number"&gt;131&lt;/SPAN&gt;)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$&lt;SPAN class="hljs-number"&gt;3&lt;/SPAN&gt;(Executor.scala:&lt;SPAN class="hljs-number"&gt;506&lt;/SPAN&gt;)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:&lt;SPAN class="hljs-number"&gt;1462&lt;/SPAN&gt;)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:&lt;SPAN class="hljs-number"&gt;509&lt;/SPAN&gt;)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:&lt;SPAN class="hljs-number"&gt;1128&lt;/SPAN&gt;)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:&lt;SPAN class="hljs-number"&gt;628&lt;/SPAN&gt;)
    at java.base/java.lang.Thread.run(Thread.java:&lt;SPAN class="hljs-number"&gt;829&lt;/SPAN&gt;)&lt;/PRE&gt;&lt;P&gt;My environment is I have spark installed in all the machines and I have a virtual environment with the other library dependencies. The same virtual environment exists across all the workers and the driver. I have pandas installed on the driver and every worker nodes in the virtual environment. I also have "spark.sql.execution.arrow.pyspark.enabled" set to "true". Any ideas on how to solve this? I checked to see if result has 0 elements but its shape is (78, 31).&lt;/P&gt;</description>
      <pubDate>Thu, 23 Jun 2022 21:13:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-toPandas-works-locally-but-fails-in-cluster/m-p/346266#M234818</guid>
      <dc:creator>nvelraj</dc:creator>
      <dc:date>2022-06-23T21:13:41Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark toPandas() works locally but fails in cluster</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-toPandas-works-locally-but-fails-in-cluster/m-p/346330#M234844</link>
      <description>&lt;DIV class="answercell post-layout--right"&gt;&lt;DIV class="s-prose js-post-body"&gt;&lt;DIV class="votecell post-layout--left"&gt;&lt;DIV class="js-voting-container d-flex jc-center fd-column ai-stretch gs4 fc-black-200"&gt;&lt;SPAN&gt;It seems like your Spark workers are pointing to the default/system installation of python rather than your virtual&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;environment&lt;/SPAN&gt;&lt;SPAN&gt;. By setting the environment variable, you can tell Spark to use your virtual environment.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;You can set the below&amp;nbsp;two configs in &amp;lt;spark_home_dir&amp;gt;/conf/spark-env.sh&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;PRE&gt;export PYSPARK_PYTHON=&amp;lt;Python_&lt;SPAN&gt;binaries_&lt;/SPAN&gt;Path&amp;gt;
export PYSPARK_DRIVER_PYTHON=&amp;lt;Python_&lt;SPAN&gt;binaries_&lt;/SPAN&gt;Path&amp;gt;&lt;/PRE&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 25 Jun 2022 22:05:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-toPandas-works-locally-but-fails-in-cluster/m-p/346330#M234844</guid>
      <dc:creator>jagadeesan</dc:creator>
      <dc:date>2022-06-25T22:05:23Z</dc:date>
    </item>
    <item>
      <title>Re: pyspark toPandas() works locally but fails in cluster</title>
      <link>https://community.cloudera.com/t5/Support-Questions/pyspark-toPandas-works-locally-but-fails-in-cluster/m-p/351285#M236208</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/98771"&gt;@nvelraj&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Pyspark job working locally because in your local system &lt;STRONG&gt;pandas library&lt;/STRONG&gt; is installed, so it is working. When you run in cluster, pandas library/module is not available so you are getting the following error.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;ModuleNotFoundError: No module named 'pandas'&lt;/LI-CODE&gt;&lt;P&gt;To solve the. issue, you need to install the pandal library/module in all machines or use Virtual environment.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Sep 2022 04:15:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/pyspark-toPandas-works-locally-but-fails-in-cluster/m-p/351285#M236208</guid>
      <dc:creator>RangaReddy</dc:creator>
      <dc:date>2022-09-01T04:15:38Z</dc:date>
    </item>
  </channel>
</rss>

