Created 02-08-2023 07:37 PM
We have a SQL that normally runs for 2 mins including a few stages, including 8 mappers and 1000 reducers. One day, among the 8 mappers, 7 finished in <1 min but 1 stuck for 2.5 hours. The logs in that container stopped after 1-2 mins (the task was killed so I could not produce the logs but there were nothing suspicious). The counters show that there were no activities at all for a very long time, CPU usage low, network, IO, all low. The 1000 reducers were up and waiting, using all the resources and kept the rest of the tasks waiting.
Re-running the SQL with the same data (no skew) multiple times and it always completes within 1-2mins. What could cause a mapper to stuck silently? Is there a way to debug this (the logs were not very helpful)?
This happens occasionally for different SQLs, is there a way or tool to detect such situation and restart / kill the job?
Thanks for the help!
Created 02-09-2023 04:49 PM
It happened again. Here is the last logs before stuck:
2023-02-10 07:08:04,545 [WARN] [TezChild] |identity.HadoopFileSystemOwner|: found no group information for hive (auth:SIMPLE), using hive as primary group
2023-02-10 07:08:04,545 [INFO] [TezChild] |s3n.S3NativeFileSystem|: Opening 's3://<removed>' for reading
2023-02-10 07:08:04,561 [INFO] [TezChild] |orc.ReaderImpl|: Reading ORC rows from s3://<removed>
2023-02-10 07:08:04,734 [INFO] [TezChild] |exec.Utilities|: PLAN PATH = hdfs://ha-nn-uri/tmp/hive/hive/348e72b2-d465-4310-ac8b-76a445e56f00/hive_2023-02-10_07-05-10_262_8811434878222725580-1824/hive/_tez_scratch_dir/68ceb5bd-f173-4bf2-826d-a5af9b017d07/map.xml
2023-02-10 07:08:05,073 [WARN] [TezChild] |identity.HadoopFileSystemOwner|: found no group information for hive (auth:SIMPLE), using hive as primary group
2023-02-10 07:08:05,073 [INFO] [TezChild] |s3n.S3NativeFileSystem|: Opening 's3://<removed>' for reading
2023-02-10 07:08:05,089 [INFO] [TezChild] |orc.ReaderImpl|: Reading ORC rows from s3://<removed>
2023-02-10 07:08:05,246 [INFO] [TezChild] |exec.Utilities|: PLAN PATH = hdfs://ha-nn-uri/tmp/hive/hive/348e72b2-d465-4310-ac8b-76a445e56f00/hive_2023-02-10_07-05-10_262_8811434878222725580-1824/hive/_tez_scratch_dir/68ceb5bd-f173-4bf2-826d-a5af9b017d07/map.xml
2023-02-10 07:08:05,539 [WARN] [TezChild] |identity.HadoopFileSystemOwner|: found no group information for hive (auth:SIMPLE), using hive as primary group
2023-02-10 07:08:05,539 [INFO] [TezChild] |s3n.S3NativeFileSystem|: Opening 's3://<removed>' for reading
2023-02-10 07:08:05,555 [INFO] [TezChild] |orc.ReaderImpl|: Reading ORC rows from s3://<removed>
2023-02-10 07:08:05,711 [INFO] [TezChild] |exec.Utilities|: PLAN PATH = hdfs://ha-nn-uri/tmp/hive/hive/348e72b2-d465-4310-ac8b-76a445e56f00/hive_2023-02-10_07-05-10_262_8811434878222725580-1824/hive/_tez_scratch_dir/68ceb5bd-f173-4bf2-826d-a5af9b017d07/map.xml
2023-02-10 07:08:06,021 [INFO] [TezChild] |exec.GroupByOperator|: Hash Tbl flush: #hash table = 774033
2023-02-10 07:08:06,043 [INFO] [TezChild] |impl.PipelinedSorter|: Reducer 2: Span2.length = 994310, perItem = 253
2023-02-10 07:08:06,118 [INFO] [TezChild] |impl.PipelinedSorter|: Newly allocated block size=268435392, index=3, Number of buffers=4, currentAllocatableMemory=310378752, currentBufferSize=268435392, total=1149239296
2023-02-10 07:08:06,118 [INFO] [TezChild] |impl.PipelinedSorter|: Reducer 2: Getting memory from next block in the list, recordsWritten=2054476
2023-02-10 07:08:06,118 [INFO] [TezChild] |impl.PipelinedSorter|: Reducer 2: reserved.remaining()=268435392, reserved.metasize=16777216
2023-02-10 07:08:06,118 [INFO] [TezChild] |impl.PipelinedSorter|: Reducer 2: New Span3.length = 1048576, perItem = 16, counter:2054476
2023-02-10 08:23:52,594 [INFO] [TezChild] |exec.GroupByOperator|: Hash Table flushed: new size = 696629You can see that the logs jump from 07:08 to 08:23 the no logs whatsoever for hours. All other 7 reducers completed within 3 mins (around 07:08).
Any help is highly appreciated. The same query ran under 30 mins in Spark but is stuck for more than 2.5 hours when using Tez.
Created 02-09-2023 05:04 PM
Also the Tez UI shows in the same job, there are 223 other reducers running starting at 07:08 (no logs after 07:08), but in the YARN UI, there is only one container running under this application. That container is not consuming any CPU. The Tez counters for the mapper or for the job as a whole does not change so I think there are no activities.
Created 02-09-2023 07:18 PM
More debugging shows:
While we can set tez.grouping.max-size to force 500 mappers and get it done quickly, I'm not sure why only one mapper stuck when there were 8 mappers? The data is not skewed, so is the last mapper doing something else that made it stuck?
Created 02-26-2023 05:25 AM
Try analyzing the jstacks for the container in a stuck state.
To identify and capture the jstacks for a stuck container you may follow this article 
 
					
				
				
			
		
