Support Questions

ganzuoni · ‎06-05-2026

I'm copying several directories with several files each

Yarn application full termination seems affected by the number of object transferred

In this case we have 9 minutes between 100% completion and job termination

26/06/05 10:47:20 INFO mapreduce.Job: map 100% reduce 0%
26/06/05 10:56:25 INFO mapreduce.Job: Job job_1780210407885_1954 completed successfully
with
Files Copied=68744
DIR_COPY=23058

while here few seconds

26/06/05 10:43:11 INFO mapreduce.Job:  map 100% reduce 0%
26/06/05 10:43:32 INFO mapreduce.Job: Job job_1780210407885_1958 completed successfully
with
Files Copied=12300
DIR_COPY=191

Is the any explanation for such a behaviour?

tnair · ‎06-05-2026

@ganzuoni Thank you for sharing the log excerpts and details regarding the distcp job.

The likely root cause is not Yarn container cleanup itself, but the DistCp/MapReduce job finalization phase taking longer due to the much higher object and directory count. The larger job had ~120x more directories than the smaller job, so post-copy operations such as directory metadata handling, output commit/cleanup, target validation, permission/timestamp preservation, and JobHistory finalization can take significantly longer even after the map phase shows 100%.
The NodeManager/RM logs also support this, since the AM container remained alive until the application moved to FINAL_SAVING/FINISHING. The repeated RM proxy calls every 10 seconds only indicate that the AM web endpoint was still being polled while the job was finalizing; they do not appear to be the cause of the delay.

3. Identify the ApplicationMaster container and review only the AM logs if the full log output is too large.
yarn application -status application_1780210407885_1976

>>Pay particular attention to whether any of these options were used:
-p,-delete,-atomic,-update,-overwrite,-direct

6. Check the ResourceManager logs around the delayed finalization window.
grep -E "application_1780210407885_1976|appattempt_1780210407885_1976|UNREGISTERED|FINAL_SAVING|FINISHING|AM Released Container" \
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.log

7. Check the NodeManager logs on the AM host around the same time window.
From the RM log, the AM web endpoint appears to be on:
almapwrk15.data.com:34620
On that NodeManager host, run:
grep -E "application_1780210407885_1976|container_e22_1780210407885_1976_01_000001|succeeded|Removed completed containers|ContainerLaunch" \
/var/log/hadoop-yarn/yarn-yarn-nodemanager-*.log

10. Suggested conclusion after validation:
The current evidence points to the MapReduce ApplicationMaster spending additional time in the post-map finalization phase, most likely due to the high number of files/directories and associated metadata/commit/cleanup operations. The RM proxy polling appears to be observational only and does not appear to be the cause. The next confirmation should come from the AM container logs around the gap between map 100% and the AM unregistering from YARN.

View solution in original post

ganzuoni · ‎06-05-2026

Checking node manager logs the time spent is between container removal and container success declaration

2026-06-05 11:08:15,452 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_e22_1780210407885_1976_01_000013]
2026-06-05 11:17:55,672 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_e22_1780210407885_1976_01_000001 succeeded

ganzuoni · ‎06-05-2026

And checking RM there is a huge number of call to the AM , one each 10 seconds:

2026-06-05 11:09:39,050 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=arcgis   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1780210407885_1976    CONTAINERID=container_e22_1780210407885_1976_01_000014  RESOURCE=<memory:2048, vCores:1>        QUEUENAME=arcgis
2026-06-05 11:09:40,102 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
2026-06-05 11:09:50,116 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
.....
2026-06-05 11:17:41,013 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1780210407885_1976_000001 with final state: FINISHING, and exit status: -1000
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1780210407885_1976_000001 State change from RUNNING to FINAL_SAVING on event = UNREGISTERED
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1780210407885_1976 with final state: FINISHING
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1780210407885_1976 State change from RUNNING to FINAL_SAVING on event = ATTEMPT_UNREGISTERED
2026-06-05 11:17:49,433 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating info for app: application_1780210407885_1976
2026-06-05 11:17:49,433 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1780210407885_1976_000001 State change from FINAL_SAVING to FINISHING on event = ATTEMPT_UPDATE_SAVED

ganzuoni · ‎06-05-2026

Anyway, everything seems related to logs collection for JobHistory

DianaTorres · ‎06-05-2026

@ganzuoni Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our YARN experts @vafs @vchalla who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Regards,

Diana Torres,
Senior Community Moderator

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

tnair · ‎06-05-2026

@ganzuoni Thank you for sharing the log excerpts and details regarding the distcp job.

The likely root cause is not Yarn container cleanup itself, but the DistCp/MapReduce job finalization phase taking longer due to the much higher object and directory count. The larger job had ~120x more directories than the smaller job, so post-copy operations such as directory metadata handling, output commit/cleanup, target validation, permission/timestamp preservation, and JobHistory finalization can take significantly longer even after the map phase shows 100%.
The NodeManager/RM logs also support this, since the AM container remained alive until the application moved to FINAL_SAVING/FINISHING. The repeated RM proxy calls every 10 seconds only indicate that the AM web endpoint was still being polled while the job was finalizing; they do not appear to be the cause of the delay.

3. Identify the ApplicationMaster container and review only the AM logs if the full log output is too large.
yarn application -status application_1780210407885_1976

>>Pay particular attention to whether any of these options were used:
-p,-delete,-atomic,-update,-overwrite,-direct

6. Check the ResourceManager logs around the delayed finalization window.
grep -E "application_1780210407885_1976|appattempt_1780210407885_1976|UNREGISTERED|FINAL_SAVING|FINISHING|AM Released Container" \
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.log

7. Check the NodeManager logs on the AM host around the same time window.
From the RM log, the AM web endpoint appears to be on:
almapwrk15.data.com:34620
On that NodeManager host, run:
grep -E "application_1780210407885_1976|container_e22_1780210407885_1976_01_000001|succeeded|Removed completed containers|ContainerLaunch" \
/var/log/hadoop-yarn/yarn-yarn-nodemanager-*.log

10. Suggested conclusion after validation:
The current evidence points to the MapReduce ApplicationMaster spending additional time in the post-map finalization phase, most likely due to the high number of files/directories and associated metadata/commit/cleanup operations. The RM proxy polling appears to be observational only and does not appear to be the cause. The next confirmation should come from the AM container logs around the gap between map 100% and the AM unregistering from YARN.

ganzuoni · ‎06-08-2026

Thank you for the great insights

These are the disctp options

distcp -skipcrccheck -copybuffersize 32768 -update -pugp  -bandwidth 50 -strategy dynamic

and yes, total execution time increases almost linearly with nr of files/directories

Support Questions

Slow distcp job termination