I am using Spark 2.1 to load data into an ORC formatted hive table. The HDP version is 2.6.1. Part of the DDL is shown where bloom.filter.column is set to hvr_last_upd_tms.
`reset_level` double, `software_subid` double, `hvr_is_deleted` bigint, `hvr_last_upd_tms` timestamp) PARTITIONED BY ( `occur_date_dt` date) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://getnamenode/apps/hive/warehouse/rmd.db/gets_tool_fault' TBLPROPERTIES ( 'orc.bloom.filter.columns'='hvr_last_upd_tms', 'orc.stripe.size'='268435456', 'spark.sql.partitionProvider'='catalog', 'transient_lastDdlTime'='1501759358') Time taken: 0.511 seconds, Fetched: 97 row(s)
I tried two options 2 load the data into the hive table
out = merge_coalesce_df.repartition(len(merge_coalesce_df.select('occur_date_dt').distinct().collect()),"occur_date_dt").sortWithinPartitions("hvr_last_upd_tms") out.write.mode("append").format("orc").insertInto(table_name)
merge_coalesce_df.registerTempTable("input") sqlContext.sql("SELECT * FROM input DISTRIBUTE BY occur_date_dt SORT BY hvr_last_upd_tms").write.format("orc").insertInto(table_name)
The Dataframe writer method created a single file per partition and on analyzing the orc dump file I observed that the column on which bloom filter was created is not sorted. Does this mean that the spark when is creating the hive orc files it is storing in an unsorted fashion? How can I ensure that the column data is sorted when I am storing the data in Hive
Column 81: count: 1229385 hasNull: false min: 2016-05-27 20:04:01.0 max: 2017-03-14 18:10:09.0 Column 81: count: 1180712 hasNull: false min: 2017-01-02 18:32:09.0 max: 2017-05-02 13:31:22.0 Column 81: count: 819630 hasNull: false min: 2017-01-04 10:01:46.0 max: 2017-07-04 12:14:09.0 Column 81: count: 508673 hasNull: false min: 2017-01-05 08:01:28.0 max: 2017-02-02 06:14:30.0
Thanks @bhagan but in the documentation it is mentioned that clustering and sorting affects the READ and no WRITE. In this particular case I am analyzing the write operation and why it is not getting sorted.
Also, I am manually sorting the data in my application therefore I don't think SORT BY clause should matter in the DDL.
Hi @Benjamin Leonhardi I was going through the presentation that you have uploaded in slideshare https://www.slideshare.net/BenjaminLeonhardi/hive-loading-data but it seems from Spark it is not working as explained in the ppt. Can you please take a look and advice?