Created 03-05-2018 01:41 PM
Hi All,
We are trying to use the new ORC features to speed up our data read and writes from Spark 2.2.x in our HDP 2.6.3 cluster.
We have set the below two parameters in our spark session
sql("SET spark.sql.hive.convertMetastoreOrc=true") sql("SET spark.sql.orc.enabled=true")
The DDL for our existing table is
CREATE EXTERNAL TABLE `get_intg.material_usage_src_data`( `service_material_id` bigint, `locomotive_id` bigint, `creation_date` timestamp, `service_workorder_id` bigint, `service_sheet_id` bigint, `inventory_item_id` bigint, `part_number` string, `part_description` string, `quantity` double, `uom_code` string, `transaction_type` string, `serial_number_issued` string, `serial_number_removed` string, `position_applied` string, `reason` string, `incident_code` string, `incident_desc` string, `ic_main_assembly` string, `ic_subsystem` string, `ic_coe` string, `generic_description_name` string, `reason_changed_code` string, `customer_id` bigint, `fleet_id` bigint, `fleet_name` string, `road_number` string, `customer_name` string, `work_order_number` string, `model_number` string, `locomotive_type_code` string, `in_service_date` timestamp, `ptn_number` double, `axel` bigint, `service_sheet_comment` string, `service_type_code` string, `traction_motor_serial` string, `manufacture_year` date, `service_organization_id ` bigint, `etl_created_by` string, `etl_created_date` timestamp, `etl_updated_by` string, `etl_updated_date` timestamp) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://getnamenode/apps/hive/warehouse/get_intg.db/material_usage_src_data' TBLPROPERTIES ( 'numFiles'='200', 'totalSize'='426739872', 'transient_lastDdlTime'='1520253415')
The exception we are getting is as below:-
o1 = sqlContext.sql("select * from get_intg.material_usage_src_data") o1.show(10) 18/03/0508:19:21 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 7)java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:201) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)18/03/05 08:19:21 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 7, localhost, executor driver): java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156)
However when we are setting one of the parameters to true there is no issue.
Can anyone please advice.
Thanks,
Jayadeep
Created 03-05-2018 02:37 PM
What is the hive version in your cluster?
Created 03-05-2018 06:55 PM
In Spark 2.2, it happens for ORC files which have dummy column names as a ORC file schema like `col1` instead of your column `service_material_id`. Please check the file schema like the following.
hive --orcfiledump thefile.orc
The workaround at HDP 2.6.3 is regenerating those files with Hive 2.X.
BTW, it's fixed in Apache Spark 2.3. There are several more issues before 2.3. Please see SPARK-20901
Created 03-05-2018 07:02 PM
In addition to that, Spark needs those options before generating ORC in order to generate vectorizable ORC files. Otherwise, Spark will generate old Hive 1.2.1 ORC files having dummy column names, `col1`.