Support Questions
Find answers, ask questions, and share your expertise

distcp leaves unfinished file in S3

Master Collaborator


 I have a job copying data from S3 to S3 using distcp which from time to time leaves an unfinished file LOAD00000092.csv.gz.____distcpSplit____0.83821251 in the S3 bucket. The job runs fine on YARN, no error is logged during the copy, no error is logged during the container execution.



Is there any way to configure the distcp to avoid using splits? Or why is this happening?

Any tips or advise is welcome, how to overcome this.




Master Guru
What are you passing in your command-line arguments to DistCp?

The split feature is a new one that is activated only if you pass a
positive integer via the -blocksperchunk flag.

Master Collaborator
No options are passed. Just like hadoop distcp s3a:location1 s3a:location2

Master Collaborator

@Harsh J 


hadoop distcp s3a://<BUCKET>/2019/01/DETAIL_USAGE/* s3a://<BUCKET>/usage_for_spark_production/2019/01/'

I do not pass any special params to the tool, just the source directory with asterix and the destination directory.

Usually it copies without a problem, but it happened twice that it left a file with the wierd suffix.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.