TL;DR; how to properly set up hive.tez.container.size for a job with wildly different steps?
I have a 8 data node hdp2.6 cluster, all data nodes are identical, with 32GB ram.
I am running only one hive MERGE statement, once per day, which has about 100k mappers.
If I set up hive.tez.container.size to 1GB, many mappers can run in parallel (faster query), but I will end up with one of those errors:
If I set up hive.tez.container.size to a bigger value I will run a lot less queries in parallel (longer query time) but eventually the query will succeed.
The thing is that I do not know in advance how big the data will be so even if I by trial and error find a good hive.tez.container.size it might not be good enough tomorrow, and maybe eventually my server memory will be too small . Further more, sizing for the worst case scenario feels like a waste of resource.
Is there any way to have a sort of dynamic tez container size to get a fast and succeeding query?