Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

distcp from S3 to S3 leaves an incomplete file behind

distcp from S3 to S3 leaves an incomplete file behind

Master Collaborator

Hi,

 I am running a daily job for moving data from one S3 location to another S3 location (in the same AWS account).

The command is 

 

hadoop  distcp s3a://BUCKET/system/2019/03/USAGE/*  s3a://BUCKET/system/usage_for_spark_production/2019/03/'

And until now it ran 100% ok.

But today one incomplete file was left behing in the target S3 folder:

 

s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000003.csv.gz.____distcpSplit____0.74517393:0+33554432

I checked the mapper's log, looking for some kind of interruption or error during copy, but the 003.csv was copied ok:

2019-04-03 08:39:29,880 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key
2019-04-03 08:39:29,927 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000028.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000028.csv.gz
2019-04-03 08:39:30,053 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:34,959 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD000000A6.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD000000A6.csv.gz
2019-04-03 08:39:34,988 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:36,171 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000090.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000090.csv.gz
2019-04-03 08:39:36,199 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:40,808 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000012.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000012.csv.gz
2019-04-03 08:39:40,875 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:45,224 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000085.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000085.csv.gz
2019-04-03 08:39:45,251 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:49,505 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000003.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000003.csv.gz
2019-04-03 08:39:49,541 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:52,815 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000092.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000092.csv.gz
2019-04-03 08:39:52,843 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:57,574 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD0000002F.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD0000002F.csv.gz
2019-04-03 08:39:57,599 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:40:02,429 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000009.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000009.csv.gz
2019-04-03 08:40:02,462 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:40:07,037 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1554271289453_0104_m_000000_0 is done. And is in the process of committing
2019-04-03 08:40:07,045 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1554271289453_0104_m_000000_0 is allowed to commit now

I also checked the RM and NN logs, looking for some kind of ERROR or KILL but nothing special was there.

Can somebody help me to explain this?

Thanks for any tips and advise,

T

1 REPLY 1
Highlighted

Re: distcp from S3 to S3 leaves an incomplete file behind

Master Collaborator
Can this relate somehow to the fact the S3 is eventually consistent?