Created on 10-13-2015 05:41 AM - edited 09-16-2022 02:43 AM
Using CDH 5.4.7-1.cdh5.4.7.p0.3, when I run multiple mapreduce jobs one after another, eventually one of the jobs will fail with this stack trace:
2015-10-13 14:22:28,187 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: { hdfs://hdfs-nameservice/user/hdfs/.staging/job_1444734646472_0003/libjars/htrace-core-3.1.0-incubating.jar, 1444738926587, FILE, null } failed: Rename cannot overwrite non empty destination directory /yarn/nm/usercache/hdfs/filecache/945
java.io.IOException: Rename cannot overwrite non empty destination directory /yarn/nm/usercache/hdfs/filecache/945
at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:909)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-10-13 14:22:28,188 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://hdfs-nameservice/user/hdfs/.staging/job_1444734646472_0003/libjars/htrace-core-3.1.0-incubating.jar(->/yarn/nm/usercache/hdfs/filecache/945/htrace-core-3.1.0-incubating.jar) transitioned from DOWNLOADING to FAILED
2015-10-13 14:22:28,188 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_e10_1444734646472_0003_01_000001 transitioned from LOCALIZING to LOCALIZATION_FAILED
2015-10-13 14:22:28,188 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl: Container container_e10_1444734646472_0003_01_000001 sent RELEASE event on a resource request { hdfs://hdfs-nameservice/user/hdfs/.staging/job_1444734646472_0003/libjars/htrace-core-3.1.0-incubating.jar, 1444738926587, FILE, null } not present in cache.
2015-10-13 14:22:28,188 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Unknown localizer with localizerId container_e10_1444734646472_0003_01_000001 is sending heartbeat. Ordering it to DIE
The number in the path varies, but restarting the failed job does not get rid of the error.
I have tried to set set the yarn.nodemanager.localizer.cache.target-size-mb to 0, restarting YARN, and waiting until after the cleanup, but it doesn't help.
The file /yarn/nm/usercache/hdfs/filecache/812 does not seem to exist before/after running the job.
Has anybody experienced this, or have an explanation as to why it happens?
Created 10-19-2015 01:24 AM
The error was initially encountered in an older version of CDH, and it disappeared when we also updated the client to the same version.
Created 10-19-2015 01:24 AM
The error was initially encountered in an older version of CDH, and it disappeared when we also updated the client to the same version.
Created 10-19-2015 05:36 AM
Congratulations on solving your issue. Feel free to mark your previous response as the solution to the issue in case it can help others in the future.