Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hive (HDInsight) POS Partition Design

Highlighted

Hive (HDInsight) POS Partition Design

New Contributor

Hi,

I'm currently working on a project that includes sales data by country & retailer. There are ~30 different retailers for 24 months of data, not sure on the number of countries per but let's assume 10 per retailer on average.

The initial partitioning strategy we've gone with is by YearMonth, Retailer & Country. These current partitions sit at ~50 million records per, ORC format w/ snappy compression the files are ~300 MB each. The number of partitions assuming these numbers would be ~7200 which looks like it exceeds what I've seen tossed around as guidelines for a relational metastore.

Given the data skew where one retailer may be several times larger than another and several fold smaller in certain countries are we setting ourselves up for problems ahead? I wanted to get a sense of what best practices we may be able to apply given our situation.

Thanks in advance for any help.

Don't have an account?
Coming from Hortonworks? Activate your account here