Support Questions

dz902 · ‎02-08-2023

We have a SQL that normally runs for 2 mins including a few stages, including 8 mappers and 1000 reducers. One day, among the 8 mappers, 7 finished in <1 min but 1 stuck for 2.5 hours. The logs in that container stopped after 1-2 mins (the task was killed so I could not produce the logs but there were nothing suspicious). The counters show that there were no activities at all for a very long time, CPU usage low, network, IO, all low. The 1000 reducers were up and waiting, using all the resources and kept the rest of the tasks waiting.

Re-running the SQL with the same data (no skew) multiple times and it always completes within 1-2mins. What could cause a mapper to stuck silently? Is there a way to debug this (the logs were not very helpful)?

This happens occasionally for different SQLs, is there a way or tool to detect such situation and restart / kill the job?

Thanks for the help!

dz902 · ‎02-09-2023

It happened again. Here is the last logs before stuck:

2023-02-10 07:08:04,545 [WARN] [TezChild] |identity.HadoopFileSystemOwner|: found no group information for hive (auth:SIMPLE), using hive as primary group
2023-02-10 07:08:04,545 [INFO] [TezChild] |s3n.S3NativeFileSystem|: Opening 's3://<removed>' for reading
2023-02-10 07:08:04,561 [INFO] [TezChild] |orc.ReaderImpl|: Reading ORC rows from s3://<removed>
2023-02-10 07:08:04,734 [INFO] [TezChild] |exec.Utilities|: PLAN PATH = hdfs://ha-nn-uri/tmp/hive/hive/348e72b2-d465-4310-ac8b-76a445e56f00/hive_2023-02-10_07-05-10_262_8811434878222725580-1824/hive/_tez_scratch_dir/68ceb5bd-f173-4bf2-826d-a5af9b017d07/map.xml
2023-02-10 07:08:05,073 [WARN] [TezChild] |identity.HadoopFileSystemOwner|: found no group information for hive (auth:SIMPLE), using hive as primary group
2023-02-10 07:08:05,073 [INFO] [TezChild] |s3n.S3NativeFileSystem|: Opening 's3://<removed>' for reading
2023-02-10 07:08:05,089 [INFO] [TezChild] |orc.ReaderImpl|: Reading ORC rows from s3://<removed>
2023-02-10 07:08:05,246 [INFO] [TezChild] |exec.Utilities|: PLAN PATH = hdfs://ha-nn-uri/tmp/hive/hive/348e72b2-d465-4310-ac8b-76a445e56f00/hive_2023-02-10_07-05-10_262_8811434878222725580-1824/hive/_tez_scratch_dir/68ceb5bd-f173-4bf2-826d-a5af9b017d07/map.xml
2023-02-10 07:08:05,539 [WARN] [TezChild] |identity.HadoopFileSystemOwner|: found no group information for hive (auth:SIMPLE), using hive as primary group
2023-02-10 07:08:05,539 [INFO] [TezChild] |s3n.S3NativeFileSystem|: Opening 's3://<removed>' for reading
2023-02-10 07:08:05,555 [INFO] [TezChild] |orc.ReaderImpl|: Reading ORC rows from s3://<removed>
2023-02-10 07:08:05,711 [INFO] [TezChild] |exec.Utilities|: PLAN PATH = hdfs://ha-nn-uri/tmp/hive/hive/348e72b2-d465-4310-ac8b-76a445e56f00/hive_2023-02-10_07-05-10_262_8811434878222725580-1824/hive/_tez_scratch_dir/68ceb5bd-f173-4bf2-826d-a5af9b017d07/map.xml
2023-02-10 07:08:06,021 [INFO] [TezChild] |exec.GroupByOperator|: Hash Tbl flush: #hash table = 774033
2023-02-10 07:08:06,043 [INFO] [TezChild] |impl.PipelinedSorter|: Reducer 2: Span2.length = 994310, perItem = 253
2023-02-10 07:08:06,118 [INFO] [TezChild] |impl.PipelinedSorter|: Newly allocated block size=268435392, index=3, Number of buffers=4, currentAllocatableMemory=310378752, currentBufferSize=268435392, total=1149239296
2023-02-10 07:08:06,118 [INFO] [TezChild] |impl.PipelinedSorter|: Reducer 2: Getting memory from next block in the list, recordsWritten=2054476
2023-02-10 07:08:06,118 [INFO] [TezChild] |impl.PipelinedSorter|: Reducer 2: reserved.remaining()=268435392, reserved.metasize=16777216
2023-02-10 07:08:06,118 [INFO] [TezChild] |impl.PipelinedSorter|: Reducer 2: New Span3.length = 1048576, perItem = 16, counter:2054476
2023-02-10 08:23:52,594 [INFO] [TezChild] |exec.GroupByOperator|: Hash Table flushed: new size = 696629

You can see that the logs jump from 07:08 to 08:23 the no logs whatsoever for hours. All other 7 reducers completed within 3 mins (around 07:08).

Any help is highly appreciated. The same query ran under 30 mins in Spark but is stuck for more than 2.5 hours when using Tez.

dz902 · ‎02-09-2023

Also the Tez UI shows in the same job, there are 223 other reducers running starting at 07:08 (no logs after 07:08), but in the YARN UI, there is only one container running under this application. That container is not consuming any CPU. The Tez counters for the mapper or for the job as a whole does not change so I think there are no activities.

dz902 · ‎02-09-2023

More debugging shows:

This is a table with 1000 files, ~6MB each, 6GB in total
When there is no resource in the cluster, mappers is set to 8 but when there are free resources, mappers goes up to 500
The SQL is a simple SELECT GROUP BY by ~10 columns, effectively deduplicating the records, nothing more

While we can set tez.grouping.max-size to force 500 mappers and get it done quickly, I'm not sure why only one mapper stuck when there were 8 mappers? The data is not skewed, so is the last mapper doing something else that made it stuck?

Asok · ‎02-26-2023

Try analyzing the jstacks for the container in a stuck state.
To identify and capture the jstacks for a stuck container you may follow this article

Cloudera Community

Support Questions

One of Tez mapper stuck for hours, how to debug or detect?