Created 11-10-2016 05:43 AM
Hi!
I run 2 to spark an option
SPARK_MAJOR_VERSION=2 pyspark --master yarn --verbose
spark starts, I run the SC and get an error, the field in the table exactly there. not the problem
SPARK_MAJOR_VERSION=2 pyspark --master yarn --verbose SPARK_MAJOR_VERSION is set to 2, using Spark2 Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org Using properties file: /usr/hdp/current/spark2-historyserver/conf/spark-defaults.conf Adding default property: spark.history.kerberos.keytab=none Adding default property: spark.history.fs.logDirectory=hdfs:///spark2-history/ Adding default property: spark.eventLog.enabled=true Adding default property: spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 Adding default property: spark.yarn.queue=default Adding default property: spark.yarn.historyServer.address=en-002.msk.mts.ru:18081 Adding default property: spark.history.kerberos.principal=none Adding default property: spark.history.provider=org.apache.spark.deploy.history.FsHistoryProvider Adding default property: spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 Adding default property: spark.eventLog.dir=hdfs:///spark2-history/ Adding default property: spark.history.ui.port=18081 Parsed arguments: master yarn deployMode null executorMemory null executorCores null totalExecutorCores null propertiesFile /usr/hdp/current/spark2-historyserver/conf/spark-defaults.conf driverMemory null driverCores null driverExtraClassPath null driverExtraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 driverExtraJavaOptions null supervise false queue null numExecutors null files null pyFiles null archives null mainClass null primaryResource pyspark-shell name PySparkShell childArgs [] jars null packages null packagesExclusions null repositories null verbose true Spark properties used, including those specified through --conf and those from the properties file /usr/hdp/current/spark2-historyserver/conf/spark-defaults.conf: spark.yarn.queue -> default spark.history.kerberos.principal -> none spark.executor.extraLibraryPath -> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.driver.extraLibraryPath -> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.eventLog.enabled -> true spark.yarn.historyServer.address -> en-002.msk.mts.ru:18081 spark.history.ui.port -> 18081 spark.history.provider -> org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory -> hdfs:///spark2-history/ spark.history.kerberos.keytab -> none spark.eventLog.dir -> hdfs:///spark2-history/ Main class: org.apache.spark.api.python.PythonGatewayServer Arguments: System properties: spark.yarn.queue -> default spark.history.kerberos.principal -> none spark.executor.extraLibraryPath -> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.driver.extraLibraryPath -> /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64 spark.yarn.historyServer.address -> en-002.msk.mts.ru:18081 spark.eventLog.enabled -> true spark.history.ui.port -> 18081 SPARK_SUBMIT -> true spark.history.provider -> org.apache.spark.deploy.history.FsHistoryProvider spark.history.fs.logDirectory -> hdfs:///spark2-history/ spark.app.name -> PySparkShell spark.history.kerberos.keytab -> none spark.submit.deployMode -> client spark.eventLog.dir -> hdfs:///spark2-history/ spark.master -> yarn spark.yarn.isPython -> true Classpath elements: Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.0.0.2.5.0.0-1245 /_/ Using Python version 2.7.12 (default, Jul 2 2016 17:42:40) SparkSession available as 'spark'. >>> >>> >>> ds = sqlContext.table('default.geo').limit(100000) >>> ds.groupby('id').count().show(10) [Stage 0:==========================================> (5 + 2) / 7]16/11/09 18:11:56 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 7, wn-019): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
traceback.
ERROR TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/dataframe.py", line 287, in show print(self._jdf.showString(n, truncate)) File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o45.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 10, wn-029): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:347) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:39) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2189) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1925) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:1924) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2562) at org.apache.spark.sql.Dataset.head(Dataset.scala:1924) at org.apache.spark.sql.Dataset.take(Dataset.scala:2139) at org.apache.spark.sql.Dataset.showString(Dataset.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ... 1 more
my ENV
OS: RHEL 6.5
HDP: 2.5.0.0
SPARK: 2.0
Python: 2.7 on Anaconda
Created 12-26-2016 07:21 PM
Your rdd is getting empty somewhere. The null pointer exception indicates that an aggregation task is attempted against of a null value. Check your data for null where not null should be present and especially on those columns that are subject of aggregation, like a reduce task, for example. In your case, it may be the id field.
Created 12-26-2016 04:42 PM
possibly you are hitting this bug in spark codegen https://issues.apache.org/jira/browse/SPARK-18528
Created 12-26-2016 07:21 PM
Your rdd is getting empty somewhere. The null pointer exception indicates that an aggregation task is attempted against of a null value. Check your data for null where not null should be present and especially on those columns that are subject of aggregation, like a reduce task, for example. In your case, it may be the id field.
Created 01-05-2017 08:51 PM
@Sergey Paramoshkin Were you able to fix this issue?