Context: I have an issue with a MERGE statement, which does not use the partitions of the destination table.
Looking for solutions, I stumbled upon this JIRA ticket which creates 3 new (in hive 0.14) configuration options:
hive.tez.dynamic.partition.pruning: default true hive.tez.dynamic.partition.pruning.max.event.size: default 1*1024*1024L hive.tez.dynamic.partition.pruning.max.data.size: default 100*1024*1024L
Now I wonder why should I not just set these variables to the max value possible to make sure that partition pruning always happen?
It is disabled if data size is too big, but I find it counter intuitive as not pruning will massively increase data size.
You can check your Hive logs for "Disabling dynamic pruning for" to see if this is what you are running into. Maybe save hive logs at DEBUG level while running EXPLAIN STATEMENT and post here, might be able to see what the issue is.
I believe adding these limits has to do with safeguards for the way Dynamic Partition Pruning is implemented in Hive. The pruning events (really the partition values from the small table) are all sent to the Tez AM (which coordinates the Hive query) during DPP. Bringing down the AM would bring down the query, so to be safe there is a limit on the amount of data sent to the AM per process (hive.tez.dynamic.partition.pruning.max.event.size) as well as the total amount of data sent by all processes (hive.tez.dynamic.partition.pruning.max.data.size).
I indeed see
INFO [HiveServer2-Handler-Pool: Thread-107]: optimizer.RemoveDynamicPruningBySize (RemoveDynamicPruningBySize.java:process(61)) - Disabling dynamic pruning for: TS. Expected data size is too big: 1119008712
So if I understand well, this has to do with event size and not data size?
I did try to get the value very high to enable pruning, pruning did indeed occur but locking all partitions timed out. Will post an explain asap.
From the code it looks like the data size (TEZ_DYNAMIC_PARTITION_PRUNING_MAX_DATA_SIZE) - the event size comes into play at execution time within a single Map/Reduce process.
Then I interpret this setting as "if there is too much data let's use it all instead of pruning it" and am very confused 🙂 I suppose it's due to internal hive implementation as you said.