I have a big hive query, a MERGE statement with a 50GB source table and a 0 GB destination table.
When running it fails because the partition hosting /hadoop/yarn/local (this dir has its own partition) on one data node is filled up to 90%, ie. the yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage threshold.
I could extend this partition, increase the setting or add another datanode, but the issue is likely to prop again when I need to merge 100GB, and then again for 250GB and so on.
What is the best way to sort this out? I could manually break the query (adding WHERE on dates for instance) but I have the feeling that there should be a cleaner solution.
Context: small (3 datanodes) HDP2.6 cluster on AWS.