Created 06-05-2026 02:19 AM
I'm copying several directories with several files each
Yarn application full termination seems affected by the number of object transferred
In this case we have 9 minutes between 100% completion and job termination
26/06/05 10:47:20 INFO mapreduce.Job: map 100% reduce 0%
26/06/05 10:56:25 INFO mapreduce.Job: Job job_1780210407885_1954 completed successfully
with
Files Copied=68744
DIR_COPY=23058while here few seconds
26/06/05 10:43:11 INFO mapreduce.Job: map 100% reduce 0%
26/06/05 10:43:32 INFO mapreduce.Job: Job job_1780210407885_1958 completed successfully
with
Files Copied=12300
DIR_COPY=191Is the any explanation for such a behaviour?
Created on 06-05-2026 03:13 PM - edited 06-05-2026 03:25 PM
@ganzuoni Thank you for sharing the log excerpts and details regarding the distcp job.
The likely root cause is not Yarn container cleanup itself, but the DistCp/MapReduce job finalization phase taking longer due to the much higher object and directory count. The larger job had ~120x more directories than the smaller job, so post-copy operations such as directory metadata handling, output commit/cleanup, target validation, permission/timestamp preservation, and JobHistory finalization can take significantly longer even after the map phase shows 100%.
The NodeManager/RM logs also support this, since the AM container remained alive until the application moved to FINAL_SAVING/FINISHING. The repeated RM proxy calls every 10 seconds only indicate that the AM web endpoint was still being polled while the job was finalizing; they do not appear to be the cause of the delay.
1. Review the ApplicationMaster logs for the delayed application.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "commit|committer|cleanup|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|_temporary|_SUCCESS|unregister|finish|final"
2. Check specifically whether the AM container was spending time in job commit or cleanup before unregistering from YARN.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "commitJob|cleanupJob|job commit|job cleanup|unregister|FinalApplicationStatus|succeeded|FINISHING|FINAL_SAVING"
3. Identify the ApplicationMaster container and review only the AM logs if the full log output is too large.
yarn application -status application_1780210407885_1976
>Then use the AM container ID from the output/logs and run:
yarn logs -applicationId application_1780210407885_1976 \
-containerId container_e22_1780210407885_1976_01_000001 \
| egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final"
4. Confirm the exact DistCp command/options used for the slower job.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "distcp|-p|-delete|-atomic|-update|-overwrite|-direct|preserve|options"
>>Pay particular attention to whether any of these options were used:
-p,-delete,-atomic,-update,-overwrite,-direct
5. Check whether metadata preservation may be increasing finalization time.
If -p was used, confirm which attributes were preserved, for example permissions, ownership, group, timestamps, ACLs, or XAttrs. These can add many filesystem metadata operations when the job has many directories/files.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "preserve|permission|owner|group|timestamp|acl|xattr|chown|chmod|setTimes"
6. Check the ResourceManager logs around the delayed finalization window.
grep -E "application_1780210407885_1976|appattempt_1780210407885_1976|UNREGISTERED|FINAL_SAVING|FINISHING|AM Released Container" \
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.log
7. Check the NodeManager logs on the AM host around the same time window.
From the RM log, the AM web endpoint appears to be on:
almapwrk15.data.com:34620
On that NodeManager host, run:
grep -E "application_1780210407885_1976|container_e22_1780210407885_1976_01_000001|succeeded|Removed completed containers|ContainerLaunch" \
/var/log/hadoop-yarn/yarn-yarn-nodemanager-*.log
8. Compare with the faster application to confirm whether the delay scales with object/directory count.
yarn logs -applicationId application_1780210407885_1958 \
| egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final"
9. If the target is object storage, confirm whether commit/rename/delete operations are slow.
For object-store targets, rename/delete/list operations can be more expensive than on HDFS. Check for object-store related messages:
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "s3a|abfs|wasb|ozone|ofs|object|rename|delete|listStatus|copy|multipart|commit"
10. Suggested conclusion after validation:
The current evidence points to the MapReduce ApplicationMaster spending additional time in the post-map finalization phase, most likely due to the high number of files/directories and associated metadata/commit/cleanup operations. The RM proxy polling appears to be observational only and does not appear to be the cause. The next confirmation should come from the AM container logs around the gap between map 100% and the AM unregistering from YARN.
Created 06-05-2026 02:31 AM
Checking node manager logs the time spent is between container removal and container success declaration
2026-06-05 11:08:15,452 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_e22_1780210407885_1976_01_000013]
2026-06-05 11:17:55,672 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_e22_1780210407885_1976_01_000001 succeeded
Created 06-05-2026 03:03 AM
And checking RM there is a huge number of call to the AM , one each 10 seconds:
2026-06-05 11:09:39,050 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=arcgis OPERATION=AM Released Container TARGET=SchedulerApp RESULT=SUCCESS APPID=application_1780210407885_1976 CONTAINERID=container_e22_1780210407885_1976_01_000014 RESOURCE=<memory:2048, vCores:1> QUEUENAME=arcgis
2026-06-05 11:09:40,102 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
2026-06-05 11:09:50,116 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
.....
2026-06-05 11:17:41,013 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1780210407885_1976_000001 with final state: FINISHING, and exit status: -1000
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1780210407885_1976_000001 State change from RUNNING to FINAL_SAVING on event = UNREGISTERED
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1780210407885_1976 with final state: FINISHING
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1780210407885_1976 State change from RUNNING to FINAL_SAVING on event = ATTEMPT_UNREGISTERED
2026-06-05 11:17:49,433 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating info for app: application_1780210407885_1976
2026-06-05 11:17:49,433 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1780210407885_1976_000001 State change from FINAL_SAVING to FINISHING on event = ATTEMPT_UPDATE_SAVED
Created 06-05-2026 03:04 AM
Anyway, everything seems related to logs collection for JobHistory
Created 06-05-2026 02:42 PM
@ganzuoni Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our YARN experts @vafs @vchalla who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created on 06-05-2026 03:13 PM - edited 06-05-2026 03:25 PM
@ganzuoni Thank you for sharing the log excerpts and details regarding the distcp job.
The likely root cause is not Yarn container cleanup itself, but the DistCp/MapReduce job finalization phase taking longer due to the much higher object and directory count. The larger job had ~120x more directories than the smaller job, so post-copy operations such as directory metadata handling, output commit/cleanup, target validation, permission/timestamp preservation, and JobHistory finalization can take significantly longer even after the map phase shows 100%.
The NodeManager/RM logs also support this, since the AM container remained alive until the application moved to FINAL_SAVING/FINISHING. The repeated RM proxy calls every 10 seconds only indicate that the AM web endpoint was still being polled while the job was finalizing; they do not appear to be the cause of the delay.
1. Review the ApplicationMaster logs for the delayed application.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "commit|committer|cleanup|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|_temporary|_SUCCESS|unregister|finish|final"
2. Check specifically whether the AM container was spending time in job commit or cleanup before unregistering from YARN.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "commitJob|cleanupJob|job commit|job cleanup|unregister|FinalApplicationStatus|succeeded|FINISHING|FINAL_SAVING"
3. Identify the ApplicationMaster container and review only the AM logs if the full log output is too large.
yarn application -status application_1780210407885_1976
>Then use the AM container ID from the output/logs and run:
yarn logs -applicationId application_1780210407885_1976 \
-containerId container_e22_1780210407885_1976_01_000001 \
| egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final"
4. Confirm the exact DistCp command/options used for the slower job.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "distcp|-p|-delete|-atomic|-update|-overwrite|-direct|preserve|options"
>>Pay particular attention to whether any of these options were used:
-p,-delete,-atomic,-update,-overwrite,-direct
5. Check whether metadata preservation may be increasing finalization time.
If -p was used, confirm which attributes were preserved, for example permissions, ownership, group, timestamps, ACLs, or XAttrs. These can add many filesystem metadata operations when the job has many directories/files.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "preserve|permission|owner|group|timestamp|acl|xattr|chown|chmod|setTimes"
6. Check the ResourceManager logs around the delayed finalization window.
grep -E "application_1780210407885_1976|appattempt_1780210407885_1976|UNREGISTERED|FINAL_SAVING|FINISHING|AM Released Container" \
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.log
7. Check the NodeManager logs on the AM host around the same time window.
From the RM log, the AM web endpoint appears to be on:
almapwrk15.data.com:34620
On that NodeManager host, run:
grep -E "application_1780210407885_1976|container_e22_1780210407885_1976_01_000001|succeeded|Removed completed containers|ContainerLaunch" \
/var/log/hadoop-yarn/yarn-yarn-nodemanager-*.log
8. Compare with the faster application to confirm whether the delay scales with object/directory count.
yarn logs -applicationId application_1780210407885_1958 \
| egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final"
9. If the target is object storage, confirm whether commit/rename/delete operations are slow.
For object-store targets, rename/delete/list operations can be more expensive than on HDFS. Check for object-store related messages:
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "s3a|abfs|wasb|ozone|ofs|object|rename|delete|listStatus|copy|multipart|commit"
10. Suggested conclusion after validation:
The current evidence points to the MapReduce ApplicationMaster spending additional time in the post-map finalization phase, most likely due to the high number of files/directories and associated metadata/commit/cleanup operations. The RM proxy polling appears to be observational only and does not appear to be the cause. The next confirmation should come from the AM container logs around the gap between map 100% and the AM unregistering from YARN.
Created 06-08-2026 12:27 AM
Thank you for the great insights
These are the disctp options
distcp -skipcrccheck -copybuffersize 32768 -update -pugp -bandwidth 50 -strategy dynamicand yes, total execution time increases almost linearly with nr of files/directories