Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

Distcp failing from HDFS to S3 with length mismatch


Hello everyone,

When performing distcp from hdfs:// to s3a:// after a while I get an error stating something like:

Caused by: Mismatch in length of source:hdfs://clustername/hbase/WALs/,16020,1491913605286/ and target:s3a://bucket-backup/hbase/.distcp.tmp.attempt_local1903592397_0001_m_000000_0

It then quickly fails with:

17/04/11 15:50:54 INFO mapreduce.Job: Job job_local1903592397_0001 failed with state FAILED due to: NA
17/04/11 15:50:54 INFO mapreduce.Job: Counters: 28
        File System Counters
                FILE: Number of bytes read=723868
                FILE: Number of bytes written=764685
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2169097700
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=469
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
                S3A: Number of bytes read=0
                S3A: Number of bytes written=2169097700
                S3A: Number of read operations=471
                S3A: Number of large read operations=0
                S3A: Number of write operations=97
        Map-Reduce Framework
                Map input records=40
                Map output records=0
                Input split bytes=156
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=1376
                Total committed heap usage (bytes)=521142272
        File Input Format Counters
                Bytes Read=13228
        File Output Format Counters
                Bytes Written=8$Counter

Any ideas? We have HBase running on top of this HDFS setup which is performing writes. Is that a problem for distcp?



From that link I see that having open files could be an issue, does this mean I can't backup with distcp (since I'm running Hbase on top and that can never be stopped)? I can't run a copytable to the local filesystem since the data is just too large for that. Are there any other sensible alternatives for backing up to S3?

Maybe @stevel can help here.

Expert Contributor

Hi @Vasco Pinho did you come right with this distcp? I am also having some errors with distcp.