I have been trying to evaluate the "sweet spot" or possibly smallest data volume that we start to see the benefits of partitioning in HIVE. To evaluate this test i have created a table which has the following fields:
BIGINT key <- generated to be unique for each record
VARCHAR(30) <- Randomly generated string
INT MOD_2 <- A number derived from the key which is set to 0 or 1 depending on if the modulus (remainder) of the key when devided by 2 is 0 or 1.
INT MOD_10 <- A number [0-9] based on the modulus of key and 10.
INT MOD_5 <- Defined similarly as above [0-4]
I have then decided to create two tables with the same number of records. One table is partitioned by MOD_10 to create 10 partitions. The size of each partition (there are 10 of them) is of around 90 MB (block size of 64 MB) each. The size of the non-partitioned table is of around 900 MB (10*90). Given this sort of a set up, I would expect a query that did something like this:
select count(1) from partitioned_table where mod_10=3 to return much faster than this select count(1) from table where mod_10=3. I am not seeing much of a performance difference. Seeing that this sort of query might be returned from the stats stored in the catalog i tried the following query on both tables:
select count(1) from table where mod_10=3 and rand()*5 > mod_5
In using the rand() function i hoped that the optimizer would recognize the first predicate as a filter on the partition and use this to filter to one of the partitions directly and then apply the 2nd predicate to just the relevant partition. For the table which is not partitioned i would need to evaluate both predicates on the full 900 MB which i assumed would lead to a much slower run time. After running this test dozens of times i found that the execution time to be approximately (to the second) the same. Is this expected behaviour? I would have expected partitioning to start to show performance improvement given this type of use case.
Any thoughts on this? What should i be looking for in the explain plan of the partitioned query to validate that it's in fact only scanning the mod_10=3 partition?
well, there are too many unknowns in your question . So, I
don’t think there can be straight forward answer provided.
Query performance depends on multiple factors: execution
engine, configuration of an engine (map memory, combined splits, etc), number
of files, etc…
On 1 GB of data query response time may vary from couple of
seconds to couple of milliseconds (best case scenario with TEZ, ORC, sorted
data and bloom filter). And BTW, in optimized storage situation, I would expect
better performance on un-partitioned table.
For more ideas, please refer to following articles about
Hive Query optimization: