Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

ORC Improvements for Apache Spark 2.2 - Issues

ORC Improvements for Apache Spark 2.2 - Issues

Contributor

Hi All,

We are trying to use the new ORC features to speed up our data read and writes from Spark 2.2.x in our HDP 2.6.3 cluster.

We have set the below two parameters in our spark session

sql("SET spark.sql.hive.convertMetastoreOrc=true")
sql("SET spark.sql.orc.enabled=true")

The DDL for our existing table is

CREATE EXTERNAL TABLE `get_intg.material_usage_src_data`(                       
   `service_material_id` bigint,                                                 
   `locomotive_id` bigint,                                                       
   `creation_date` timestamp,                                                   
   `service_workorder_id` bigint,                                               
   `service_sheet_id` bigint,                                                   
   `inventory_item_id` bigint,                                                   
   `part_number` string,                                                         
   `part_description` string,                                                   
   `quantity` double,                                                           
   `uom_code` string,                                                           
   `transaction_type` string,                                                   
   `serial_number_issued` string,                                               
   `serial_number_removed` string,                                               
   `position_applied` string,                                                   
   `reason` string,                                                             
   `incident_code` string,                                                       
   `incident_desc` string,                                                       
   `ic_main_assembly` string,                                                   
   `ic_subsystem` string,                                                       
   `ic_coe` string,                                                             
   `generic_description_name` string,                                           
   `reason_changed_code` string,                                                 
   `customer_id` bigint,                                                         
   `fleet_id` bigint,                                                           
   `fleet_name` string,                                                         
   `road_number` string,                                                         
   `customer_name` string,                                                       
   `work_order_number` string,                                                   
   `model_number` string,                                                       
   `locomotive_type_code` string,                                               
   `in_service_date` timestamp,                                                 
   `ptn_number` double,                                                         
   `axel` bigint,                                                               
   `service_sheet_comment` string,                                               
   `service_type_code` string,                                                   
   `traction_motor_serial` string,                                               
   `manufacture_year` date,                                                     
   `service_organization_id ` bigint,                                           
   `etl_created_by` string,                                                     
   `etl_created_date` timestamp,                                                 
   `etl_updated_by` string,                                                     
   `etl_updated_date` timestamp)                                                 
 ROW FORMAT SERDE                                                               
   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                                   
 STORED AS INPUTFORMAT                                                           
   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                             
 OUTPUTFORMAT                                                                   
   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'                           
 LOCATION                                                                       
   'hdfs://getnamenode/apps/hive/warehouse/get_intg.db/material_usage_src_data' 
 TBLPROPERTIES (                                                                 
   'numFiles'='200',                                                             
   'totalSize'='426739872',                                                     
   'transient_lastDdlTime'='1520253415') 

The exception we are getting is as below:-

o1 = sqlContext.sql("select * from get_intg.material_usage_src_data")
o1.show(10)

18/03/0508:19:21 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 7)java.lang.AssertionError:
assertion failed at scala.Predef$.assert(Predef.scala:156) 
at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.scala:201) 
at
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.scala:114) 




at
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) 
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) 
at
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177) 
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) 
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
Source) 
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source) 
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) 
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) 
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) 
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) 
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) 
at org.apache.spark.scheduler.Task.run(Task.scala:108) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) 
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)18/03/05
08:19:21 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 7, localhost,
executor driver): java.lang.AssertionError: assertion failed 
at scala.Predef$.assert(Predef.scala:156)

However when we are setting one of the parameters to true there is no issue.

Can anyone please advice.

Thanks,

Jayadeep

3 REPLIES 3

Re: ORC Improvements for Apache Spark 2.2 - Issues

What is the hive version in your cluster?

Re: ORC Improvements for Apache Spark 2.2 - Issues

Expert Contributor

Hi, @Jayadeep Jayaraman

In Spark 2.2, it happens for ORC files which have dummy column names as a ORC file schema like `col1` instead of your column `service_material_id`. Please check the file schema like the following.

hive --orcfiledump thefile.orc

The workaround at HDP 2.6.3 is regenerating those files with Hive 2.X.

BTW, it's fixed in Apache Spark 2.3. There are several more issues before 2.3. Please see SPARK-20901

Re: ORC Improvements for Apache Spark 2.2 - Issues

Expert Contributor

In addition to that, Spark needs those options before generating ORC in order to generate vectorizable ORC files. Otherwise, Spark will generate old Hive 1.2.1 ORC files having dummy column names, `col1`.