Member since
12-26-2019
17
Posts
0
Kudos Received
1
Solution
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 396 | 06-05-2026 03:13 PM |
06-05-2026
03:13 PM
@ganzuoni Thank you for sharing the log excerpts and details regarding the distcp job. The likely root cause is not Yarn container cleanup itself, but the DistCp/MapReduce job finalization phase taking longer due to the much higher object and directory count. The larger job had ~120x more directories than the smaller job, so post-copy operations such as directory metadata handling, output commit/cleanup, target validation, permission/timestamp preservation, and JobHistory finalization can take significantly longer even after the map phase shows 100%. The NodeManager/RM logs also support this, since the AM container remained alive until the application moved to FINAL_SAVING/FINISHING. The repeated RM proxy calls every 10 seconds only indicate that the AM web endpoint was still being polled while the job was finalizing; they do not appear to be the cause of the delay. 1. Review the ApplicationMaster logs for the delayed application. yarn logs -applicationId application_1780210407885_1976 \ | egrep -i "commit|committer|cleanup|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|_temporary|_SUCCESS|unregister|finish|final" 2. Check specifically whether the AM container was spending time in job commit or cleanup before unregistering from YARN. yarn logs -applicationId application_1780210407885_1976 \ | egrep -i "commitJob|cleanupJob|job commit|job cleanup|unregister|FinalApplicationStatus|succeeded|FINISHING|FINAL_SAVING" 3. Identify the ApplicationMaster container and review only the AM logs if the full log output is too large. yarn application -status application_1780210407885_1976 >Then use the AM container ID from the output/logs and run: yarn logs -applicationId application_1780210407885_1976 \ -containerId container_e22_1780210407885_1976_01_000001 \ | egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final" 4. Confirm the exact DistCp command/options used for the slower job. yarn logs -applicationId application_1780210407885_1976 \ | egrep -i "distcp|-p|-delete|-atomic|-update|-overwrite|-direct|preserve|options" >>Pay particular attention to whether any of these options were used: -p,-delete,-atomic,-update,-overwrite,-direct 5. Check whether metadata preservation may be increasing finalization time. If -p was used, confirm which attributes were preserved, for example permissions, ownership, group, timestamps, ACLs, or XAttrs. These can add many filesystem metadata operations when the job has many directories/files. yarn logs -applicationId application_1780210407885_1976 \ | egrep -i "preserve|permission|owner|group|timestamp|acl|xattr|chown|chmod|setTimes" 6. Check the ResourceManager logs around the delayed finalization window. grep -E "application_1780210407885_1976|appattempt_1780210407885_1976|UNREGISTERED|FINAL_SAVING|FINISHING|AM Released Container" \ /var/log/hadoop-yarn/yarn-yarn-resourcemanager-*.log 7. Check the NodeManager logs on the AM host around the same time window. From the RM log, the AM web endpoint appears to be on: almapwrk15.data.com:34620 On that NodeManager host, run: grep -E "application_1780210407885_1976|container_e22_1780210407885_1976_01_000001|succeeded|Removed completed containers|ContainerLaunch" \ /var/log/hadoop-yarn/yarn-yarn-nodemanager-*.log 8. Compare with the faster application to confirm whether the delay scales with object/directory count. yarn logs -applicationId application_1780210407885_1958 \ | egrep -i "commit|cleanup|committer|CopyCommitter|OutputCommitter|JobHistory|history|rename|delete|unregister|finish|final" 9. If the target is object storage, confirm whether commit/rename/delete operations are slow. For object-store targets, rename/delete/list operations can be more expensive than on HDFS. Check for object-store related messages: yarn logs -applicationId application_1780210407885_1976 \ | egrep -i "s3a|abfs|wasb|ozone|ofs|object|rename|delete|listStatus|copy|multipart|commit" 10. Suggested conclusion after validation: The current evidence points to the MapReduce ApplicationMaster spending additional time in the post-map finalization phase, most likely due to the high number of files/directories and associated metadata/commit/cleanup operations. The RM proxy polling appears to be observational only and does not appear to be the cause. The next confirmation should come from the AM container logs around the gap between map 100% and the AM unregistering from YARN.
... View more
08-21-2024
09:03 AM
Based on the error shared, i see CM is 7.11.3 and CDP runtime is 7.1.7. There is a known bug as is discussed in the following documentation: Link: https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/manager-release-notes/topics/cm-known-issues-7113.html#:~:text=OPSAPS%2D69357%3A%20Python%20incompatibility%20issues%20when%20Cloudera%20Manager%20(Python%203.x%20compatible)%20manages%20a%20cluster%20with%20Cloudera%20Runtime%207.1.7%20(Python%202%20compatible) OPSAPS-69357: Python incompatibility issues when Cloudera Manager (Python 3.x compatible) manages a cluster with Cloudera Runtime 7.1.7 (Python 2 compatible) If Cloudera Manager is compatible with Python 3, then scripts that are packaged with this Cloudera Manager are also ported to Python 3 syntax. So, using Cloudera Manager (7.11.3 or any other Cloudera Manager version ported to Python 3.x version) to manage a cluster with Cloudera Runtime 7.1.7 (Python 2 compatible) would cause Python incompatibility issues because the process assumes Python 2 environment but the scripts that are packaged with this Cloudera Manager are ported to Python 3 syntax.
... View more