Created 10-17-2016 08:14 AM
There is a lot of information how is necessary to avoid small files and a large number of partitions in Hive. But what if I can’t avoid them?
I have to store a Hive-table with 10 years of history data. It contains 3710 daily partitions at present day. Every partition is really small, from 80 to 15000 records. In csv format partitions vary from 25Kb to 10Mb. In ORC format partitions vary from 10Kb to 2Mb. Though I don’t think that ORC format would be effective for that small size. Queries to this table usually include date or period of dates, so daily partition is preferred.
What would be optimal approach (in terms of performance) for such a large amount of small data?
Created 10-17-2016 06:48 PM
Disclaimer: Without having any knowledge of the data
STEPS:
1. For such small sets of data I would partition by YEAR.
2. I would insert the data ordering by timestamp. (Use PIG if Hive is taking more time)
3. Table structure:
....
PARTITIONED BY (
year STRING COMMENT '')
STORED AS ORC tblproperties("orc.compress"="ZLIB",
"orc.bloom.filter.columns"="time_stamp",
"orc.create.index"="true",
"orc.stripe.size"="268435456"
,"orc.row.index.stride"="12000",
"orc.compress.size"="262144"
);
4. Collect statistics on table.
5. Set few config parameters in hive
set hive.optimize.index.filter=true;
set hive.exec.orc.skip.corrupt.data=true;
set hive.vectorized.execution.enabled=true;
set hive.exec.compress.output=true;
set hive.execution.engine=tez;
set tez.am.container.reuse.enabled=TRUE;
set hive.compute.query.using.stats=true;
set stats.reliable=true; set hive.cbo.enable=true;
set hive.optimize.sort.dynamic.partition=true;
set hive.optimize.ppd=true;
set hive.optimize.ppd.storage=true;
set hive.merge.tezfiles=true;
set hive.hadoop.supports.splittable.combineinputformat=true;
set mapreduce.map.speculative=true;
6. Query with YEAR extracted from timestamp (try regex_replace function in hive) and TIMESTAMP
DO not miss any of the steps above and post us about the awesome results you get:)
Thanks
Created 10-21-2016 02:07 PM
@Alena Melnikova Good to hear that you are happy with the results:)
Answers:
1. You can go as low as 1k. Choose a balanced option on the average number of rows you query.
2. The usage of function to_date I believe will cause the orc index to stop working (Haven't tested that). Google "why function based index?"
Created 10-22-2016 09:11 AM
got it, thanks!
Created 10-24-2016 09:17 PM
Great job @Alena Melnikova! Nice work with the data and visualization. Really helpful, confirms some longstanding assumptions I've had.
Created 01-04-2018 11:54 AM
Hey everyone,
I have a somewhat similar question, which I posted here:
https://community.hortonworks.com/questions/155681/how-to-defragment-hdfs-data.html
I would really appreciate any ideas.