Support Questions

aloha · ‎10-17-2016

There is a lot of information how is necessary to avoid small files and a large number of partitions in Hive. But what if I can’t avoid them?

I have to store a Hive-table with 10 years of history data. It contains 3710 daily partitions at present day. Every partition is really small, from 80 to 15000 records. In csv format partitions vary from 25Kb to 10Mb. In ORC format partitions vary from 10Kb to 2Mb. Though I don’t think that ORC format would be effective for that small size. Queries to this table usually include date or period of dates, so daily partition is preferred.

What would be optimal approach (in terms of performance) for such a large amount of small data?

rbiswas1 · ‎10-17-2016

Hi @Alena Melnikova

Disclaimer: Without having any knowledge of the data

STEPS:

1. For such small sets of data I would partition by YEAR.

2. I would insert the data ordering by timestamp. (Use PIG if Hive is taking more time)

3. Table structure:

....

PARTITIONED BY (

year STRING COMMENT '')

STORED AS ORC tblproperties("orc.compress"="ZLIB",

"orc.bloom.filter.columns"="time_stamp",

"orc.create.index"="true",

"orc.stripe.size"="268435456"

,"orc.row.index.stride"="12000",

"orc.compress.size"="262144"

);

4. Collect statistics on table.

5. Set few config parameters in hive

set hive.optimize.index.filter=true;

set hive.exec.orc.skip.corrupt.data=true;

set hive.vectorized.execution.enabled=true;

set hive.exec.compress.output=true;

set hive.execution.engine=tez;

set tez.am.container.reuse.enabled=TRUE;

set hive.compute.query.using.stats=true;

set stats.reliable=true; set hive.cbo.enable=true;

set hive.optimize.sort.dynamic.partition=true;

set hive.optimize.ppd=true;

set hive.optimize.ppd.storage=true;

set hive.merge.tezfiles=true;

set hive.hadoop.supports.splittable.combineinputformat=true;

set mapreduce.map.speculative=true;

6. Query with YEAR extracted from timestamp (try regex_replace function in hive) and TIMESTAMP

DO not miss any of the steps above and post us about the awesome results you get:)

Thanks

View solution in original post

rbiswas1 · ‎10-21-2016

@Alena Melnikova Good to hear that you are happy with the results:)

Answers:

1. You can go as low as 1k. Choose a balanced option on the average number of rows you query.

2. The usage of function to_date I believe will cause the orc index to stop working (Haven't tested that). Google "why function based index?"

aloha · ‎10-22-2016

got it, thanks!

bpreachuk · ‎10-24-2016

Great job @Alena Melnikova! Nice work with the data and visualization. Really helpful, confirms some longstanding assumptions I've had.

zack_riesland · ‎01-04-2018

Hey everyone,
I have a somewhat similar question, which I posted here:
https://community.hortonworks.com/questions/155681/how-to-defragment-hdfs-data.html

I would really appreciate any ideas.

cc @Lester Martin @Jagatheesh Ramakrishnan @rbiswas

Cloudera Community

Support Questions

The best approach to the thousands of small partitions