Reply
Master
Posts: 430
Registered: ‎07-01-2015

distcp leaves unfinished file in S3

Hi,

 I have a job copying data from S3 to S3 using distcp which from time to time leaves an unfinished file LOAD00000092.csv.gz.____distcpSplit____0.83821251 in the S3 bucket. The job runs fine on YARN, no error is logged during the copy, no error is logged during the container execution.

image.png

 

Is there any way to configure the distcp to avoid using splits? Or why is this happening?

Any tips or advise is welcome, how to overcome this.

Thanks

T

Posts: 1,886
Kudos: 425
Solutions: 300
Registered: ‎07-31-2013

Re: distcp leaves unfinished file in S3

What are you passing in your command-line arguments to DistCp?

The split feature is a new one that is activated only if you pass a
positive integer via the -blocksperchunk flag.
Highlighted
Master
Posts: 430
Registered: ‎07-01-2015

Re: distcp leaves unfinished file in S3

No options are passed. Just like hadoop distcp s3a:location1 s3a:location2
Master
Posts: 430
Registered: ‎07-01-2015

Re: distcp leaves unfinished file in S3

@Harsh J 

 

hadoop distcp s3a://<BUCKET>/2019/01/DETAIL_USAGE/* s3a://<BUCKET>/usage_for_spark_production/2019/01/'

I do not pass any special params to the tool, just the source directory with asterix and the destination directory.

Usually it copies without a problem, but it happened twice that it left a file with the wierd suffix.

Announcements