Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Why not set hive.tez.dynamic.partition.pruning.max.data.size

Why not set hive.tez.dynamic.partition.pruning.max.data.size

Expert Contributor

Context: I have an issue with a MERGE statement, which does not use the partitions of the destination table.

Looking for solutions, I stumbled upon this JIRA ticket which creates 3 new (in hive 0.14) configuration options:

hive.tez.dynamic.partition.pruning: default true
hive.tez.dynamic.partition.pruning.max.event.size: default 1*1024*1024L
hive.tez.dynamic.partition.pruning.max.data.size: default 100*1024*1024L

Now I wonder why should I not just set these variables to the max value possible to make sure that partition pruning always happen?

It is disabled if data size is too big, but I find it counter intuitive as not pruning will massively increase data size.

Cheers,

4 REPLIES 4

Re: Why not set hive.tez.dynamic.partition.pruning.max.data.size

Cloudera Employee

You can check your Hive logs for "Disabling dynamic pruning for" to see if this is what you are running into. Maybe save hive logs at DEBUG level while running EXPLAIN STATEMENT and post here, might be able to see what the issue is.

I believe adding these limits has to do with safeguards for the way Dynamic Partition Pruning is implemented in Hive. The pruning events (really the partition values from the small table) are all sent to the Tez AM (which coordinates the Hive query) during DPP. Bringing down the AM would bring down the query, so to be safe there is a limit on the amount of data sent to the AM per process (hive.tez.dynamic.partition.pruning.max.event.size) as well as the total amount of data sent by all processes (hive.tez.dynamic.partition.pruning.max.data.size).

Re: Why not set hive.tez.dynamic.partition.pruning.max.data.size

Expert Contributor

I indeed see

INFO  [HiveServer2-Handler-Pool: Thread-107]: optimizer.RemoveDynamicPruningBySize (RemoveDynamicPruningBySize.java:process(61)) - Disabling dynamic pruning for: TS. Expected data size is too big: 1119008712

So if I understand well, this has to do with event size and not data size?

I did try to get the value very high to enable pruning, pruning did indeed occur but locking all partitions timed out. Will post an explain asap.

Highlighted

Re: Why not set hive.tez.dynamic.partition.pruning.max.data.size

Cloudera Employee

From the code it looks like the data size (TEZ_DYNAMIC_PARTITION_PRUNING_MAX_DATA_SIZE) - the event size comes into play at execution time within a single Map/Reduce process.

Re: Why not set hive.tez.dynamic.partition.pruning.max.data.size

Expert Contributor

Then I interpret this setting as "if there is too much data let's use it all instead of pruning it" and am very confused :) I suppose it's due to internal hive implementation as you said.