Created 04-16-2019 03:54 PM
I'm trying to work with Spark and Hive for HDP 3.0. As I see in some articles, now we have to use Hive Warehouse Connector. Everything works fine except for one proble.
I can't write a dataframe directlyh from Spark to Hive using Hive Warehouse connector.
I have the following code:
val spark = SparkSession .builder() .appName("Spark Hive Test Job") .config("spark.sql.hive.hiveserver2.jdbc.url", "jdbc:hive2://s01.ndtstu.local:2181,m01.ndtstu.local:2181,m02.ndtstu.local:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;user=hive;password=hive?") .getOrCreate() val hive = HiveWarehouseSession.session(spark).build() hive.createDatabase("testADD", true); hive.setDatabase("testADD") hive.createTable("tabletest").ifNotExists() .column("k", "int") .column("value", "int") .create() import spark.implicits._ val x = Seq((4,4),(5,5)).toDF("k","value") x.write .format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector") .mode("append") .option("database", "testADD") .option("table", "tabletest2") .save()
And I have the following error stack trace:
19/04/16 13:09:37 ERROR WriteToDataSourceV2Exec: Data source writer com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceWriter@7057dbda is aborting. 19/04/16 13:09:37 ERROR HiveWarehouseDataSourceWriter: Aborted DataWriter job 20190416130935-a22961fa-7dee-4eed-a41c-7f67bbf3bdae 19/04/16 13:09:37 ERROR WriteToDataSourceV2Exec: Data source writer com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataSourceWriter@7057dbda aborted. Exception in thread "main" org.apache.spark.SparkException: Writing job aborted. at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:112) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:256) at test.Test2$.main(Test2.scala:40) at test.Test2.main(Test2.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, s05.ndtstu.local, executor 1): java.lang.AbstractMethodError: com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataWriterFactory.createDataWriter(II)Lorg/apache/spark/sql/sources/v2/writer/DataWriter; at org.apache.spark.sql.execution.datasources.v2.InternalRowDataWriterFactory.createDataWriter(WriteToDataSourceV2.scala:191) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:129) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:82) ... 25 more Caused by: java.lang.AbstractMethodError: com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataWriterFactory.createDataWriter(II)Lorg/apache/spark/sql/sources/v2/writer/DataWriter; at org.apache.spark.sql.execution.datasources.v2.InternalRowDataWriterFactory.createDataWriter(WriteToDataSourceV2.scala:191) at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2.scala:129) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:79) at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$2.apply(WriteToDataSourceV2.scala:78) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
I can't not find any information regarding to this error
Created 04-17-2019 09:40 AM
Finally I find the solution. The problem was regarding to an incorrect version of the HWC jar.
We have to take into consideration that the HWC jar has to be compatible with the HDP version. In my case I was using maven with Eclipse to build the solution
Created 04-17-2019 09:40 AM
Finally I find the solution. The problem was regarding to an incorrect version of the HWC jar.
We have to take into consideration that the HWC jar has to be compatible with the HDP version. In my case I was using maven with Eclipse to build the solution