Support Questions

Find answers, ask questions, and share your expertise

Slow distcp job termination

avatar
Explorer

I'm copying several directories with several files each

Yarn application full termination seems affected by the number of object transferred

In this case we have 9 minutes between 100% completion and job termination

26/06/05 10:47:20 INFO mapreduce.Job: map 100% reduce 0%
26/06/05 10:56:25 INFO mapreduce.Job: Job job_1780210407885_1954 completed successfully
with
Files Copied=68744
DIR_COPY=23058

while here few seconds

26/06/05 10:43:11 INFO mapreduce.Job:  map 100% reduce 0%
26/06/05 10:43:32 INFO mapreduce.Job: Job job_1780210407885_1958 completed successfully
with
Files Copied=12300
DIR_COPY=191

 Is the any explanation for such a behaviour?

1 ACCEPTED SOLUTION

avatar
Cloudera Employee

@ganzuoni Thank you for sharing the log excerpts and details regarding the distcp job.

The likely root cause is not Yarn container cleanup itself, but the DistCp/MapReduce job finalization phase taking longer due to the much higher object and directory count. The larger job had ~120x more directories than the smaller job, so post-copy operations such as directory metadata handling, output commit/cleanup, target validation, permission/timestamp preservation, and JobHistory finalization can take significantly longer even after the map phase shows 100%.
The NodeManager/RM logs also support this, since the AM container remained alive until the application moved to FINAL_SAVING/FINISHING. The repeated RM proxy calls every 10 seconds only indicate that the AM web endpoint was still being polled while the job was finalizing; they do not appear to be the cause of the delay.

1. Review the ApplicationMaster logs for the delayed application.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "commit|committer|cleanup|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|_temporary|_SUCCESS|unregister|finish|final"

2. Check specifically whether the AM container was spending time in job commit or cleanup before unregistering from YARN.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "commitJob|cleanupJob|job commit|job cleanup|unregister|FinalApplicationStatus|succeeded|FINISHING|FINAL_SAVING"

3. Identify the ApplicationMaster container and review only the AM logs if the full log output is too large.
yarn application -status application_1780210407885_1976

>Then use the AM container ID from the output/logs and run:
yarn logs -applicationId application_1780210407885_1976 \
-containerId container_e22_1780210407885_1976_01_000001 \
| egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final"

4. Confirm the exact DistCp command/options used for the slower job.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "distcp|-p|-delete|-atomic|-update|-overwrite|-direct|preserve|options"

>>Pay particular attention to whether any of these options were used:
-p,-delete,-atomic,-update,-overwrite,-direct

5. Check whether metadata preservation may be increasing finalization time.
If -p was used, confirm which attributes were preserved, for example permissions, ownership, group, timestamps, ACLs, or XAttrs. These can add many filesystem metadata operations when the job has many directories/files.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "preserve|permission|owner|group|timestamp|acl|xattr|chown|chmod|setTimes"

6. Check the ResourceManager logs around the delayed finalization window.
grep -E "application_1780210407885_1976|appattempt_1780210407885_1976|UNREGISTERED|FINAL_SAVING|FINISHING|AM Released Container" \
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.log

7. Check the NodeManager logs on the AM host around the same time window.
From the RM log, the AM web endpoint appears to be on:
almapwrk15.data.com:34620
On that NodeManager host, run:
grep -E "application_1780210407885_1976|container_e22_1780210407885_1976_01_000001|succeeded|Removed completed containers|ContainerLaunch" \
/var/log/hadoop-yarn/yarn-yarn-nodemanager-*.log

8. Compare with the faster application to confirm whether the delay scales with object/directory count.
yarn logs -applicationId application_1780210407885_1958 \
| egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final"

9. If the target is object storage, confirm whether commit/rename/delete operations are slow.
For object-store targets, rename/delete/list operations can be more expensive than on HDFS. Check for object-store related messages:
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "s3a|abfs|wasb|ozone|ofs|object|rename|delete|listStatus|copy|multipart|commit"

10. Suggested conclusion after validation:
The current evidence points to the MapReduce ApplicationMaster spending additional time in the post-map finalization phase, most likely due to the high number of files/directories and associated metadata/commit/cleanup operations. The RM proxy polling appears to be observational only and does not appear to be the cause. The next confirmation should come from the AM container logs around the gap between map 100% and the AM unregistering from YARN.

View solution in original post

6 REPLIES 6

avatar
Explorer

Checking node manager logs the time spent is between container removal and container success declaration 

2026-06-05 11:08:15,452 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_e22_1780210407885_1976_01_000013]
2026-06-05 11:17:55,672 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_e22_1780210407885_1976_01_000001 succeeded

avatar
Explorer

And checking RM there is a huge number of call to the AM , one each 10 seconds:

2026-06-05 11:09:39,050 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=arcgis   OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS  APPID=application_1780210407885_1976    CONTAINERID=container_e22_1780210407885_1976_01_000014  RESOURCE=<memory:2048, vCores:1>        QUEUENAME=arcgis
2026-06-05 11:09:40,102 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
2026-06-05 11:09:50,116 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
.....
2026-06-05 11:17:41,013 INFO org.apache.hadoop.yarn.server.webproxy.WebAppProxyServlet: null is accessing unchecked http://almapwrk15.data.com:34620/ws/v1/mapreduce/jobs/job_1780210407885_1976 which is the app master GUI of application_1780210407885_1976 owned by arcgis
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Updating application attempt appattempt_1780210407885_1976_000001 with final state: FINISHING, and exit status: -1000
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1780210407885_1976_000001 State change from RUNNING to FINAL_SAVING on event = UNREGISTERED
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Updating application application_1780210407885_1976 with final state: FINISHING
2026-06-05 11:17:49,422 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1780210407885_1976 State change from RUNNING to FINAL_SAVING on event = ATTEMPT_UNREGISTERED
2026-06-05 11:17:49,433 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Updating info for app: application_1780210407885_1976
2026-06-05 11:17:49,433 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1780210407885_1976_000001 State change from FINAL_SAVING to FINISHING on event = ATTEMPT_UPDATE_SAVED

avatar
Explorer

Anyway, everything seems related to logs collection for JobHistory

avatar
Community Manager

@ganzuoni Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our YARN experts @vafs @vchalla who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Senior Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Cloudera Employee

@ganzuoni Thank you for sharing the log excerpts and details regarding the distcp job.

The likely root cause is not Yarn container cleanup itself, but the DistCp/MapReduce job finalization phase taking longer due to the much higher object and directory count. The larger job had ~120x more directories than the smaller job, so post-copy operations such as directory metadata handling, output commit/cleanup, target validation, permission/timestamp preservation, and JobHistory finalization can take significantly longer even after the map phase shows 100%.
The NodeManager/RM logs also support this, since the AM container remained alive until the application moved to FINAL_SAVING/FINISHING. The repeated RM proxy calls every 10 seconds only indicate that the AM web endpoint was still being polled while the job was finalizing; they do not appear to be the cause of the delay.

1. Review the ApplicationMaster logs for the delayed application.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "commit|committer|cleanup|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|_temporary|_SUCCESS|unregister|finish|final"

2. Check specifically whether the AM container was spending time in job commit or cleanup before unregistering from YARN.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "commitJob|cleanupJob|job commit|job cleanup|unregister|FinalApplicationStatus|succeeded|FINISHING|FINAL_SAVING"

3. Identify the ApplicationMaster container and review only the AM logs if the full log output is too large.
yarn application -status application_1780210407885_1976

>Then use the AM container ID from the output/logs and run:
yarn logs -applicationId application_1780210407885_1976 \
-containerId container_e22_1780210407885_1976_01_000001 \
| egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final"

4. Confirm the exact DistCp command/options used for the slower job.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "distcp|-p|-delete|-atomic|-update|-overwrite|-direct|preserve|options"

>>Pay particular attention to whether any of these options were used:
-p,-delete,-atomic,-update,-overwrite,-direct

5. Check whether metadata preservation may be increasing finalization time.
If -p was used, confirm which attributes were preserved, for example permissions, ownership, group, timestamps, ACLs, or XAttrs. These can add many filesystem metadata operations when the job has many directories/files.
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "preserve|permission|owner|group|timestamp|acl|xattr|chown|chmod|setTimes"

6. Check the ResourceManager logs around the delayed finalization window.
grep -E "application_1780210407885_1976|appattempt_1780210407885_1976|UNREGISTERED|FINAL_SAVING|FINISHING|AM Released Container" \
/var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.log

7. Check the NodeManager logs on the AM host around the same time window.
From the RM log, the AM web endpoint appears to be on:
almapwrk15.data.com:34620
On that NodeManager host, run:
grep -E "application_1780210407885_1976|container_e22_1780210407885_1976_01_000001|succeeded|Removed completed containers|ContainerLaunch" \
/var/log/hadoop-yarn/yarn-yarn-nodemanager-*.log

8. Compare with the faster application to confirm whether the delay scales with object/directory count.
yarn logs -applicationId application_1780210407885_1958 \
| egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final"

9. If the target is object storage, confirm whether commit/rename/delete operations are slow.
For object-store targets, rename/delete/list operations can be more expensive than on HDFS. Check for object-store related messages:
yarn logs -applicationId application_1780210407885_1976 \
| egrep -i "s3a|abfs|wasb|ozone|ofs|object|rename|delete|listStatus|copy|multipart|commit"

10. Suggested conclusion after validation:
The current evidence points to the MapReduce ApplicationMaster spending additional time in the post-map finalization phase, most likely due to the high number of files/directories and associated metadata/commit/cleanup operations. The RM proxy polling appears to be observational only and does not appear to be the cause. The next confirmation should come from the AM container logs around the gap between map 100% and the AM unregistering from YARN.

avatar
Explorer

Thank you for the great insights

These are the disctp options

distcp -skipcrccheck -copybuffersize 32768 -update -pugp  -bandwidth 50 -strategy dynamic

and yes, total execution time increases almost linearly with nr of files/directories