Reply
Highlighted
Master
Posts: 430
Registered: ‎07-01-2015

distcp from S3 to S3 leaves an incomplete file behind

Hi,

 I am running a daily job for moving data from one S3 location to another S3 location (in the same AWS account).

The command is 

 

hadoop  distcp s3a://BUCKET/system/2019/03/USAGE/*  s3a://BUCKET/system/usage_for_spark_production/2019/03/'

And until now it ran 100% ok.

But today one incomplete file was left behing in the target S3 folder:

 

s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000003.csv.gz.____distcpSplit____0.74517393:0+33554432

I checked the mapper's log, looking for some kind of interruption or error during copy, but the 003.csv was copied ok:

2019-04-03 08:39:29,880 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: fs.s3a.server-side-encryption-key is deprecated. Instead, use fs.s3a.server-side-encryption.key
2019-04-03 08:39:29,927 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000028.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000028.csv.gz
2019-04-03 08:39:30,053 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:34,959 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD000000A6.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD000000A6.csv.gz
2019-04-03 08:39:34,988 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:36,171 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000090.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000090.csv.gz
2019-04-03 08:39:36,199 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:40,808 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000012.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000012.csv.gz
2019-04-03 08:39:40,875 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:45,224 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000085.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000085.csv.gz
2019-04-03 08:39:45,251 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:49,505 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000003.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000003.csv.gz
2019-04-03 08:39:49,541 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:52,815 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000092.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000092.csv.gz
2019-04-03 08:39:52,843 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:39:57,574 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD0000002F.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD0000002F.csv.gz
2019-04-03 08:39:57,599 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:40:02,429 INFO [main] org.apache.hadoop.tools.mapred.CopyMapper: Copying s3a://BUCKET/system/2019/03/DETAIL_USAGE/LOAD00000009.csv.gz to s3a://BUCKET/system/usage_for_spark_production/2019/03/LOAD00000009.csv.gz
2019-04-03 08:40:02,462 INFO [main] org.apache.hadoop.tools.mapred.RetriableFileCopyCommand: Creating temp file: s3a://BUCKET/system/usage_for_spark_production/2019/03/.distcp.tmp.attempt_1554271289453_0104_m_000000_0
2019-04-03 08:40:07,037 INFO [main] org.apache.hadoop.mapred.Task: Task:attempt_1554271289453_0104_m_000000_0 is done. And is in the process of committing
2019-04-03 08:40:07,045 INFO [main] org.apache.hadoop.mapred.Task: Task attempt_1554271289453_0104_m_000000_0 is allowed to commit now

I also checked the RM and NN logs, looking for some kind of ERROR or KILL but nothing special was there.

Can somebody help me to explain this?

Thanks for any tips and advise,

T

Master
Posts: 430
Registered: ‎07-01-2015

Re: distcp from S3 to S3 leaves an incomplete file behind

Can this relate somehow to the fact the S3 is eventually consistent?
Announcements

Our community is getting a little larger. And a lot better.


Learn More about the Cloudera and Hortonworks community merger planned for late July and early August.