Hive query with group by clause stuck in reducer phase for a very long time having large amount of data
ROOT CAUSE:
This happens in the case when GROUPBY clause is not optimized. By default Hive puts the data with the same group-by keys to the same reducer. If the distinct value of the group-by columns has data skew, one reducer may get most of the shuffled data and will be stuck for a very long time on this reducer.
WORKAROUND:
In this case increasing the tez container memory will not help. We can avoid data skewness using the following properties before running the query,