Reply
Expert Contributor
Posts: 263
Registered: ‎01-25-2017

Distcp un balance copy between the mappers

Hi,

 

When i'm running DistCp to copy data between cluster, almost all the mappers finished in minutes to hour and the last one taking more than 40 hours.

 

The Listing includes already files that copied to the other cluster and new ones that needed to copy.

 

The file size is different some is GB and other are KB to MB.

 

Any suggestions?

Highlighted
Posts: 1,565
Kudos: 287
Solutions: 239
Registered: ‎07-31-2013

Re: Distcp un balance copy between the mappers

Try to use the '-strategy dynamic' option, which converts the copies to pull from a work list instead of being pre-planned. This should yield more uniform work distribution.

This is covered in the documentation: http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/D... and http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/D...
Backline Customer Operations Engineer
Expert Contributor
Posts: 263
Registered: ‎01-25-2017

Re: Distcp un balance copy between the mappers

Hi Harsh,

 

I'm trying to run DistCp first run, by creating snapshot S0 in the source and DistCp this S0 to the backup cluster, but since the DistCp'ed folder contain more than 3,000,000 files and 70 T, the running DistCp log is flooding the application master local file system, Is there a way to solve this, as a work around i'm thinking to DistCp the subfolder separetly, then creating the S0 snapshot in the source and distCped it. Any other smark ideas?

Expert Contributor
Posts: 263
Registered: ‎01-25-2017

Re: Distcp un balance copy between the mappers

Any insights?

Announcements