Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Hive ORC Bloom Filter Ordering

Spark Hive ORC Bloom Filter Ordering

Contributor
Hi All,

I am using Spark 2.1 to load data into an ORC formatted hive table. The HDP version is 2.6.1. Part of the DDL is shown where bloom.filter.column is set to hvr_last_upd_tms.

  `reset_level` double,
  `software_subid` double,
  `hvr_is_deleted` bigint,
  `hvr_last_upd_tms` timestamp)
PARTITIONED BY (
  `occur_date_dt` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://getnamenode/apps/hive/warehouse/rmd.db/gets_tool_fault'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='hvr_last_upd_tms',
  'orc.stripe.size'='268435456',
  'spark.sql.partitionProvider'='catalog',
  'transient_lastDdlTime'='1501759358')
Time taken: 0.511 seconds, Fetched: 97 row(s)

I tried two options 2 load the data into the hive table

Option 1

out = merge_coalesce_df.repartition(len(merge_coalesce_df.select('occur_date_dt').distinct().collect()),"occur_date_dt").sortWithinPartitions("hvr_last_upd_tms")

out.write.mode("append").format("orc").insertInto(table_name)

Option 2

merge_coalesce_df.registerTempTable("input")
sqlContext.sql("SELECT * FROM input DISTRIBUTE BY occur_date_dt SORT BY hvr_last_upd_tms").write.format("orc").insertInto(table_name)

The Dataframe writer method created a single file per partition and on analyzing the orc dump file I observed that the column on which bloom filter was created is not sorted. Does this mean that the spark when is creating the hive orc files it is storing in an unsorted fashion? How can I ensure that the column data is sorted when I am storing the data in Hive

    Column 81: count: 1229385 hasNull: false min: 2016-05-27 20:04:01.0 max: 2017-03-14 18:10:09.0
    Column 81: count: 1180712 hasNull: false min: 2017-01-02 18:32:09.0 max: 2017-05-02 13:31:22.0
    Column 81: count: 819630 hasNull: false min: 2017-01-04 10:01:46.0 max: 2017-07-04 12:14:09.0
    Column 81: count: 508673 hasNull: false min: 2017-01-05 08:01:28.0 max: 2017-02-02 06:14:30.0

Hive Loading

3 REPLIES 3
Highlighted

Re: Spark Hive ORC Bloom Filter Ordering

Expert Contributor

Try creating your table with bucketing and sorting as described here:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-BucketedSorted...

Re: Spark Hive ORC Bloom Filter Ordering

Contributor

Thanks @bhagan but in the documentation it is mentioned that clustering and sorting affects the READ and no WRITE. In this particular case I am analyzing the write operation and why it is not getting sorted.

Also, I am manually sorting the data in my application therefore I don't think SORT BY clause should matter in the DDL.

Re: Spark Hive ORC Bloom Filter Ordering

Contributor

Hi @Benjamin Leonhardi I was going through the presentation that you have uploaded in slideshare https://www.slideshare.net/BenjaminLeonhardi/hive-loading-data but it seems from Spark it is not working as explained in the ppt. Can you please take a look and advice?