Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

spark action functions (for example: groupBy ) error with hive acid tables

spark action functions (for example: groupBy ) error with hive acid tables

New Contributor

Hi,

I am using some acid tables on Spark 2.3 with HiveWarehouseConnector (hive-warehouse-connector_2.11-1.0.0.3.0.1.0-136.jar) . When the acid tables have base file or some small delta files, there is no problem with groupBy function. However, when the acid tables have lots of delta files, I got the following error: (When i make a compact operation for acid tables, problem was solved. But i have to operate these acid tables everyday and i cannot make compact operation everyday.)

Code:

val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()

val thk_all= hive.executeQuery("select * from acid.evdo_tahakkuk")

thk_all.groupBy("durum").count().show(10)

Error:

19/01/18 13:37:33 WARN TaskSetManager: Lost task 39.0 in stage 0.0 (TID 43, analitik10.gelbim.gov.tr, executor 2): TaskKilled (Stage cancelled) org.apache.spark.SparkException: Job aborted due to stage failure: Task 43 in stage 0.0 failed 4 times, most recent failure: Lost task 43.3 in stage 0.0 (TID 47, analitik08.gelbim.gov.tr, executor 6): org.apache.spark.util.TaskCompletionListenerException: null at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117) at org.apache.spark.scheduler.Task.run(Task.scala:125) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$failJobAndIndependentStages(DAGScheduler.scala:1602) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1590) at org.apache.spark.scheduler.DAGScheduler$anonfun$abortStage$1.apply(DAGScheduler.scala:1589) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) at org.apache.spark.util.EventLoop$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$collectFromPlan(Dataset.scala:3273) at org.apache.spark.sql.Dataset$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$anonfun$head$1.apply(Dataset.scala:2484) at org.apache.spark.sql.Dataset$anonfun$52.apply(Dataset.scala:3254) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253) at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) at org.apache.spark.sql.Dataset.showString(Dataset.scala:254) at org.apache.spark.sql.Dataset.show(Dataset.scala:723) at org.apache.spark.sql.Dataset.show(Dataset.scala:682) ... 51 elided Caused by: org.apache.spark.util.TaskCompletionListenerException: null at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:139) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:117) at org.apache.spark.scheduler.Task.run(Task.scala:125) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) scala> 19/01/18 13:37:33 WARN TaskSetManager: Lost task 33.0 in stage 0.0 (TID 41, analitik13.gelbim.gov.tr, executor 3): TaskKilled (Stage cancelled)

Does anyone have any idea for that ?

1 REPLY 1
Highlighted

Re: spark action functions (for example: groupBy ) error with hive acid tables

New Contributor

I have been having this problem intermittently as well. Did you find a solution?

In my case, I found the following error in container log files:

java.lang.NullPointerException

at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.close(HiveWarehouseDataReader.java:105)

 Looking at HiveWarehouseDataReader.java line 105, it looks like this:

columnarBatch.close();

So I thought this might be missing a null check, and after adding the null check, problem is solved:

if(columnarBatch != null) {
columnarBatch.close();
}

Since Hortonworks has not been responsive to a previous pull request I created for spark_llap project, I'm not going to bother creating one for this. If Hortonworks/Cloudera care anymore about their code, they should be able to make a fix.