Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

The best approach to the thousands of small partitions

Solved Go to solution
Highlighted

The best approach to the thousands of small partitions

Rising Star

There is a lot of information how is necessary to avoid small files and a large number of partitions in Hive. But what if I can’t avoid them?

I have to store a Hive-table with 10 years of history data. It contains 3710 daily partitions at present day. Every partition is really small, from 80 to 15000 records. In csv format partitions vary from 25Kb to 10Mb. In ORC format partitions vary from 10Kb to 2Mb. Though I don’t think that ORC format would be effective for that small size. Queries to this table usually include date or period of dates, so daily partition is preferred.

What would be optimal approach (in terms of performance) for such a large amount of small data?

1 ACCEPTED SOLUTION

Accepted Solutions

Re: The best approach to the thousands of small partitions

Hi @Alena Melnikova

Disclaimer: Without having any knowledge of the data

STEPS:

1. For such small sets of data I would partition by YEAR.

2. I would insert the data ordering by timestamp. (Use PIG if Hive is taking more time)

3. Table structure:

....

PARTITIONED BY (

year STRING COMMENT '')

STORED AS ORC tblproperties("orc.compress"="ZLIB",

"orc.bloom.filter.columns"="time_stamp",

"orc.create.index"="true",

"orc.stripe.size"="268435456"

,"orc.row.index.stride"="12000",

"orc.compress.size"="262144"

);

4. Collect statistics on table.

5. Set few config parameters in hive

set hive.optimize.index.filter=true;

set hive.exec.orc.skip.corrupt.data=true;

set hive.vectorized.execution.enabled=true;

set hive.exec.compress.output=true;

set hive.execution.engine=tez;

set tez.am.container.reuse.enabled=TRUE;

set hive.compute.query.using.stats=true;

set stats.reliable=true; set hive.cbo.enable=true;

set hive.optimize.sort.dynamic.partition=true;

set hive.optimize.ppd=true;

set hive.optimize.ppd.storage=true;

set hive.merge.tezfiles=true;

set hive.hadoop.supports.splittable.combineinputformat=true;

set mapreduce.map.speculative=true;

6. Query with YEAR extracted from timestamp (try regex_replace function in hive) and TIMESTAMP

DO not miss any of the steps above and post us about the awesome results you get:)

Thanks

13 REPLIES 13

Re: The best approach to the thousands of small partitions

Rising Star

@jramakrishnan Thanks.

I have already read these links. There is no clear answer. What would be good file format for small partitions? csv, orc, smth else? HBase as an alternative metastore is fine, but my Hive 1.2.1 still uses MySQL. It was an idea about generate hash using the date. I would be glad if someone explains this idea in details.

Re: The best approach to the thousands of small partitions

The whole goal of having partitions is to allow Hive to limit the files it will have to look at in order to fulfill the SQL request you send into it. On the other hand, you also clearly understand that having too many small files to look at is a performance/scalability drag. With so few number of records for each day, I'd suggest partitioning at the month level (as a single string such as @Joseph Niemiec and @bpreachuk suggest in their answers to https://community.hortonworks.com/questions/29031/best-pratices-for-hive-partitioning-especially-by.....

This will allow you to keep your "original" dates as a column and let the partition months be a new virtual column. Of course, you'll need to train/explain to your query writers the benefit of using this virtual column of the partition name in the queries, but will then get the value of partitioning all while having 1/30th of the files and each of them being 30x bigger.

Good luck!

Re: The best approach to the thousands of small partitions

Rising Star

@Lester Martin Thank you,

I keep in reserve option with monthly partition (YYYY-MM). This complicates queries. But if it's the only way I'll have to use it.

Re: The best approach to the thousands of small partitions

Hi @Alena Melnikova

Disclaimer: Without having any knowledge of the data

STEPS:

1. For such small sets of data I would partition by YEAR.

2. I would insert the data ordering by timestamp. (Use PIG if Hive is taking more time)

3. Table structure:

....

PARTITIONED BY (

year STRING COMMENT '')

STORED AS ORC tblproperties("orc.compress"="ZLIB",

"orc.bloom.filter.columns"="time_stamp",

"orc.create.index"="true",

"orc.stripe.size"="268435456"

,"orc.row.index.stride"="12000",

"orc.compress.size"="262144"

);

4. Collect statistics on table.

5. Set few config parameters in hive

set hive.optimize.index.filter=true;

set hive.exec.orc.skip.corrupt.data=true;

set hive.vectorized.execution.enabled=true;

set hive.exec.compress.output=true;

set hive.execution.engine=tez;

set tez.am.container.reuse.enabled=TRUE;

set hive.compute.query.using.stats=true;

set stats.reliable=true; set hive.cbo.enable=true;

set hive.optimize.sort.dynamic.partition=true;

set hive.optimize.ppd=true;

set hive.optimize.ppd.storage=true;

set hive.merge.tezfiles=true;

set hive.hadoop.supports.splittable.combineinputformat=true;

set mapreduce.map.speculative=true;

6. Query with YEAR extracted from timestamp (try regex_replace function in hive) and TIMESTAMP

DO not miss any of the steps above and post us about the awesome results you get:)

Thanks

Re: The best approach to the thousands of small partitions

@Alena Melnikova taking a second glance at my answer I feel that you would not need any partition at all, provided you can compact yearly data into 1 file. So 10 files in total. A bit more steps but would work as gracefully as the yearly partition.

So try out solution 1 and when it works, try out solution 2 and then pick one:)

Thanks

Re: The best approach to the thousands of small partitions

Rising Star

@rbiswas Thank you! It's interesting idea. I'll test parititions by year (YYYY), by year and month as suggested @Lester Martin (YYYY-MM) and daily (YYYY-MM-DD). I'll share results here.

By the way, what would be difference between two of your approaches? Partitions by year and compact one year in one file gives the same 10 files.

Re: The best approach to the thousands of small partitions

@Alena Melnikova you got it, there is no difference apart from a very subtle one. On approach one we still kept a bit of dependency on partition pruning and on approach 2 it is entirely dependent on ordering of data via ORC index.

Re: The best approach to the thousands of small partitions

Rising Star

@rbiswas, @Lester Martin

I tested 4 variants of partitioning on 6 queries:

Daily partitions (calday=2016-10-20)
Year-month partitions (year_month=2016-10)
Year partitions (year=2016)
No partitions (but 10 files with yearly data)

It was created 4 tables following @rbiswas recommendations.

Here is yearly aggregate information about data. Just to give you idea about scale of data.

partition size records
year=2006 539.4 K 12 217
year=2007 2.8 M 75 584
year=2008 6.4 M 155 850
year=2009 9.1 M 228 247
year=2010 9.3 M 225 357
year=2011 8.5 M 196 280
year=2012 19.5 M 448 145
year=2013 113.4 M 2 494 787
year=2014 196.7 M 4 038 632
year=2015 204.3 M 4 047 002
year=2016 227.2 M 4 363 214

I run every query 5 times, cast the worst/best results and took the average of the remaining three.

The results are below:

8736-hw-partitions.png

Obviously, daily partitioning is the worst case. But it is not so clearly to the rest of the options. The results depend on the query. In the end I decided that the yearly partitioning in our case would be optimal. @rbiswas, thanks for the idea!

@rbiswas, I have couple of questions:

1. Given that I have less than 10,000 records per day would it be better to set orc.row.index.stride less than 12000?

2. In my table I have columns:

Order_date string (looks '2016-10-20'),
Order_time timestamp (looks '2016-10-20 12:45:55')

The table is sorted by order_time as you recommended and has a bloom filter index. But filter

WHERE to_date(order_time) BETWEEN ... any period 

works 15-20% slower than

WHERE order_date BETWEEN ... any period 

Actually I expected that using column with bloom filter speeds up query execution. Why it did not happen?

Don't have an account?
Coming from Hortonworks? Activate your account here