We have a SQL that normally runs for 2 mins including a few stages, including 8 mappers and 1000 reducers. One day, among the 8 mappers, 7 finished in <1 min but 1 stuck for 2.5 hours. The logs in that container stopped after 1-2 mins (the task was killed so I could not produce the logs but there were nothing suspicious). The counters show that there were no activities at all for a very long time, CPU usage low, network, IO, all low. The 1000 reducers were up and waiting, using all the resources and kept the rest of the tasks waiting.
Re-running the SQL with the same data (no skew) multiple times and it always completes within 1-2mins. What could cause a mapper to stuck silently? Is there a way to debug this (the logs were not very helpful)?
This happens occasionally for different SQLs, is there a way or tool to detect such situation and restart / kill the job?
Also the Tez UI shows in the same job, there are 223 other reducers running starting at 07:08 (no logs after 07:08), but in the YARN UI, there is only one container running under this application. That container is not consuming any CPU. The Tez counters for the mapper or for the job as a whole does not change so I think there are no activities.
This is a table with 1000 files, ~6MB each, 6GB in total
When there is no resource in the cluster, mappers is set to 8 but when there are free resources, mappers goes up to 500
The SQL is a simple SELECT GROUP BY by ~10 columns, effectively deduplicating the records, nothing more
While we can set tez.grouping.max-size to force 500 mappers and get it done quickly, I'm not sure why only one mapper stuck when there were 8 mappers? The data is not skewed, so is the last mapper doing something else that made it stuck?