I'm currently working on a project that includes sales data by country & retailer. There are ~30 different retailers for 24 months of data, not sure on the number of countries per but let's assume 10 per retailer on average.
The initial partitioning strategy we've gone with is by YearMonth, Retailer & Country. These current partitions sit at ~50 million records per, ORC format w/ snappy compression the files are ~300 MB each. The number of partitions assuming these numbers would be ~7200 which looks like it exceeds what I've seen tossed around as guidelines for a relational metastore.
Given the data skew where one retailer may be several times larger than another and several fold smaller in certain countries are we setting ourselves up for problems ahead? I wanted to get a sense of what best practices we may be able to apply given our situation.