<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: PySpark failuer spark.SparkException: Job aborted due to stage failure in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171148#M45797</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11559/serg30911.html" nodeid="11559"&gt;@Sergey Paramoshkin&lt;/A&gt;&lt;/P&gt;&lt;P&gt;possibly you are hitting this bug in spark codegen  &lt;A href="https://issues.apache.org/jira/browse/SPARK-18528" target="_blank"&gt;https://issues.apache.org/jira/browse/SPARK-18528&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 27 Dec 2016 00:42:16 GMT</pubDate>
    <dc:creator>rajkumar_singh</dc:creator>
    <dc:date>2016-12-27T00:42:16Z</dc:date>
    <item>
      <title>PySpark failuer spark.SparkException: Job aborted due to stage failure</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171147#M45796</link>
      <description>&lt;P&gt;Hi! &lt;/P&gt;&lt;P&gt;I run 2 to spark an option &lt;/P&gt;&lt;PRE&gt;SPARK_MAJOR_VERSION=2 pyspark --master yarn  --verbose&lt;/PRE&gt;&lt;P&gt;spark starts, I run the SC and get an error, the field in the table exactly there. not the problem&lt;/P&gt;&lt;PRE&gt;SPARK_MAJOR_VERSION=2 pyspark --master yarn  --verbose
SPARK_MAJOR_VERSION is set to 2, using Spark2
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: &lt;A href="http://continuum.io/thanks" target="_blank"&gt;http://continuum.io/thanks&lt;/A&gt; and &lt;A href="https://anaconda.org" target="_blank"&gt;https://anaconda.org&lt;/A&gt;
Using properties file: /usr/hdp/current/spark2-historyserver/conf/spark-defaults.conf
Adding default property: spark.history.kerberos.keytab=none
Adding default property: spark.history.fs.logDirectory=hdfs:///spark2-history/
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
Adding default property: spark.yarn.queue=default
Adding default property: spark.yarn.historyServer.address=en-002.msk.mts.ru:18081
Adding default property: spark.history.kerberos.principal=none
Adding default property: spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider
Adding default property: spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
Adding default property: spark.eventLog.dir=hdfs:///spark2-history/
Adding default property: spark.history.ui.port=18081
Parsed arguments:
  master                  yarn
  deployMode              null
  executorMemory          null
  executorCores           null
  totalExecutorCores      null
  propertiesFile          /usr/hdp/current/spark2-historyserver/conf/spark-defaults.conf
  driverMemory            null
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            null
  files                   null
  pyFiles                 null
  archives                null
  mainClass               null
  primaryResource         pyspark-shell
  name                    PySparkShell
  childArgs               []
  jars                    null
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true


Spark properties used, including those specified through
 --conf and those from the properties file /usr/hdp/current/spark2-historyserver/conf/spark-defaults.conf:
  spark.yarn.queue -&amp;gt; default
  spark.history.kerberos.principal -&amp;gt; none
  spark.executor.extraLibraryPath -&amp;gt; /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
  spark.driver.extraLibraryPath -&amp;gt; /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
  spark.eventLog.enabled -&amp;gt; true
  spark.yarn.historyServer.address -&amp;gt; en-002.msk.mts.ru:18081
  spark.history.ui.port -&amp;gt; 18081
  spark.history.provider -&amp;gt; org.apache.spark.deploy.history.FsHistoryProvider
  spark.history.fs.logDirectory -&amp;gt; hdfs:///spark2-history/
  spark.history.kerberos.keytab -&amp;gt; none
  spark.eventLog.dir -&amp;gt; hdfs:///spark2-history/




Main class:
org.apache.spark.api.python.PythonGatewayServer
Arguments:


System properties:
spark.yarn.queue -&amp;gt; default
spark.history.kerberos.principal -&amp;gt; none
spark.executor.extraLibraryPath -&amp;gt; /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.driver.extraLibraryPath -&amp;gt; /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.yarn.historyServer.address -&amp;gt; en-002.msk.mts.ru:18081
spark.eventLog.enabled -&amp;gt; true
spark.history.ui.port -&amp;gt; 18081
SPARK_SUBMIT -&amp;gt; true
spark.history.provider -&amp;gt; org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory -&amp;gt; hdfs:///spark2-history/
spark.app.name -&amp;gt; PySparkShell
spark.history.kerberos.keytab -&amp;gt; none
spark.submit.deployMode -&amp;gt; client
spark.eventLog.dir -&amp;gt; hdfs:///spark2-history/
spark.master -&amp;gt; yarn
spark.yarn.isPython -&amp;gt; true
Classpath elements:








Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0.2.5.0.0-1245
      /_/


Using Python version 2.7.12 (default, Jul  2 2016 17:42:40)
SparkSession available as 'spark'.
&amp;gt;&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt; ds = sqlContext.table('default.geo').limit(100000)
&amp;gt;&amp;gt;&amp;gt; ds.groupby('id').count().show(10)
[Stage 0:==========================================&amp;gt;                (5 + 2) / 7]16/11/09 18:11:56 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 7, wn-019): java.lang.NullPointerException
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


&lt;/PRE&gt;&lt;P&gt;traceback. &lt;/P&gt;&lt;PRE&gt; ERROR TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "&amp;lt;stdin&amp;gt;", line 1, in &amp;lt;module&amp;gt;
  File "/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/dataframe.py", line 287, in show
    print(self._jdf.showString(n, truncate))
  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
  File "/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)


  File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o45.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 10, wn-029): java.lang.NullPointerException
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39)
        at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
        at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924)
        at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:239)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more
&lt;/PRE&gt;&lt;P&gt;my ENV &lt;/P&gt;&lt;P&gt;OS: RHEL 6.5&lt;/P&gt;&lt;P&gt;HDP: 2.5.0.0&lt;/P&gt;&lt;P&gt;SPARK: 2.0&lt;/P&gt;&lt;P&gt;Python: 2.7 on Anaconda&lt;/P&gt;</description>
      <pubDate>Thu, 10 Nov 2016 13:43:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171147#M45796</guid>
      <dc:creator>serg30911</dc:creator>
      <dc:date>2016-11-10T13:43:49Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark failuer spark.SparkException: Job aborted due to stage failure</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171148#M45797</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11559/serg30911.html" nodeid="11559"&gt;@Sergey Paramoshkin&lt;/A&gt;&lt;/P&gt;&lt;P&gt;possibly you are hitting this bug in spark codegen  &lt;A href="https://issues.apache.org/jira/browse/SPARK-18528" target="_blank"&gt;https://issues.apache.org/jira/browse/SPARK-18528&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Dec 2016 00:42:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171148#M45797</guid>
      <dc:creator>rajkumar_singh</dc:creator>
      <dc:date>2016-12-27T00:42:16Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark failuer spark.SparkException: Job aborted due to stage failure</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171149#M45798</link>
      <description>&lt;P&gt;&lt;A href="https://community.hortonworks.com/users/11559/serg30911.html"&gt;@Sergey Paramoshkin&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Your rdd is getting empty somewhere. The null pointer exception indicates that an aggregation task is attempted against of a null value. Check your data for null where not null should be present and especially on those columns that are subject of aggregation, like a reduce task, for example. In your case, it may be the id field.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Dec 2016 03:21:51 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171149#M45798</guid>
      <dc:creator>cstanca</dc:creator>
      <dc:date>2016-12-27T03:21:51Z</dc:date>
    </item>
    <item>
      <title>Re: PySpark failuer spark.SparkException: Job aborted due to stage failure</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171150#M45799</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11559/serg30911.html" nodeid="11559"&gt;@Sergey Paramoshkin&lt;/A&gt; Were you able to fix this issue? &lt;/P&gt;</description>
      <pubDate>Fri, 06 Jan 2017 04:51:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/m-p/171150#M45799</guid>
      <dc:creator>sandyy006</dc:creator>
      <dc:date>2017-01-06T04:51:30Z</dc:date>
    </item>
  </channel>
</rss>

