Created 09-25-2025 12:57 AM
Error: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for user/myname/.cm/distcp-staging/2025-09-21-05-14-47-c4f9/intermediate.1 at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:447) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:152) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:133) at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:3566) at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:3360) at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:3336) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2899) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2938) at com.cloudera.enterprise.distcp.util.DistCpUtils.sortListing(DistCpUtils.java:427) at com.cloudera.enterprise.distcp.mapred.StatusReducer.lambda$deleteMissing$1(StatusReducer.java:152) at com.cloudera.enterprise.distcp.mapred.StatusReducerProgress.track(StatusReducerProgress.java:211) at com.cloudera.enterprise.distcp.mapred.StatusReducerProgress.trackSortSourceListing(StatusReducerProgress.java:223) at com.cloudera.enterprise.distcp.mapred.StatusReducer.deleteMissing(StatusReducer.java:151) at com.cloudera.enterprise.distcp.mapred.StatusReducer.cleanup(StatusReducer.java:89) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:179) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:628) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
This is the error I am getting at first I suspected if i don't have enough space in my local dir but all the dir have enough space, should not be permission issue as it only fail sometimes most of the time replication is successful but this issue persist atleast once a week. Can someone help where should i look next.
Created 09-25-2025 01:25 PM
Hi @cravani @james_jones @ggangadharan Do you have some insights here? Thanks!
Regards,
Diana Torres,Created 09-26-2025 07:31 AM
Hi @ishashrestha ,
Since the issue just happen intermittently is most likely that one of the workers nodes have a local disk issue.
Because the job will be executed as a MapReduce job, and Yarn creates their containers with scratch dirs locally, probably one of the nodes have this problem.
Please check the yarn.nodemanager.local-dirs and mapreduce.cluster.local.dir to know the location of the scratch-dirs, then confirm if each worker node have enough disk space or the correct permissions.
Let me know if this is the case.
Best Regards
Created on 09-26-2025 07:38 AM - edited 09-26-2025 07:44 AM
@Shmoo I initially thought the issue might be related to space in the local directory, but after checking, I noticed it sometimes fails and sometimes passes in the same node, even when the file sizes are similar and there's enough space in the dir. Is there anything else I might be missing that I should check?
Created 09-26-2025 07:45 AM
Hi @ishashrestha ,
Well, another thing is that the path user/myname/.cm/distcp-staging/... suggests this DistCp job was initiated or managed by Cloudera Manager (CM).
This process uses a separate staging directory, but it still relies on the NodeManager's local directories for intermediate sorting, which is where the error is occurring (SequenceFile$Sorter.sort).
Confirm that the user myname is correctly mapped and has the necessary permissions across the cluster. While you dismissed permissions, an intermittent Kerberos ticket issue or a transient user mapping problem on one specific node could cause this weekly failure.
The next steps should focus on reviewing the NodeManager health and logs for the specific nodes that failed, checking the status of the local scratch directories on those nodes, and coordinating the failure time with any scheduled system maintenance.
Best Regards
Created 09-26-2025 07:56 AM
@ShmooThank you for the details. I’ll review the mentioned points and check accordingly.