I have a job copying data from S3 to S3 using distcp which from time to time leaves an unfinished file LOAD00000092.csv.gz.____distcpSplit____0.83821251 in the S3 bucket. The job runs fine on YARN, no error is logged during the copy, no error is logged during the container execution.
Is there any way to configure the distcp to avoid using splits? Or why is this happening?
Any tips or advise is welcome, how to overcome this.
hadoop distcp s3a://<BUCKET>/2019/01/DETAIL_USAGE/* s3a://<BUCKET>/usage_for_spark_production/2019/01/'
I do not pass any special params to the tool, just the source directory with asterix and the destination directory.
Usually it copies without a problem, but it happened twice that it left a file with the wierd suffix.