Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Distcp un balance copy between the mappers

Highlighted

Distcp un balance copy between the mappers

Super Collaborator

Hi,

 

When i'm running DistCp to copy data between cluster, almost all the mappers finished in minutes to hour and the last one taking more than 40 hours.

 

The Listing includes already files that copied to the other cluster and new ones that needed to copy.

 

The file size is different some is GB and other are KB to MB.

 

Any suggestions?

3 REPLIES 3

Re: Distcp un balance copy between the mappers

Master Guru
Try to use the '-strategy dynamic' option, which converts the copies to pull from a work list instead of being pre-planned. This should yield more uniform work distribution.

This is covered in the documentation: http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/D... and http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/D...

Re: Distcp un balance copy between the mappers

Super Collaborator

Hi Harsh,

 

I'm trying to run DistCp first run, by creating snapshot S0 in the source and DistCp this S0 to the backup cluster, but since the DistCp'ed folder contain more than 3,000,000 files and 70 T, the running DistCp log is flooding the application master local file system, Is there a way to solve this, as a work around i'm thinking to DistCp the subfolder separetly, then creating the S0 snapshot in the source and distCped it. Any other smark ideas?

Re: Distcp un balance copy between the mappers

Super Collaborator

Any insights?