Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Distcp using Webhdfs brings down the Job tracker

Distcp using Webhdfs brings down the Job tracker

New Contributor

We are having a strange issue while trying to pump large amount of data using distcp to another cluster .

 
To give out some numbers, 
 
Data size = 320GB
Number of Mappers = ~70
Total number of nodes in our cluster 82
src cluster : hadoop 0.20
Destination cluster : hadoop 2.0.2
 
When we kick off this job, All the mappers complete successfully but the last one takes too long and when it completes/fails it basically freezes the Job tracker for close to 15 mins after which all the task trackers get restarted thereby restarting all the jobs that were running at that time in the cluster. 
 
We have multiple distcp jobs transferring data to S3 as well as other clusters with same hadoop setupand have not faced this issue. The only difference between this process and other is that here we are using webhdfs, Is webhdfs the bottleneck?
 
We also tried reducing the file size, check network bandwidth for saturation,load on machines but still could not get around this issue.
 
Also, is there any other way for data transfer instead of using webhdfs when the two hadoop versions are not the same
 
Thanks
1 REPLY 1

Re: Distcp using Webhdfs brings down the Job tracker

Master Collaborator

I have moved this post to the HDFS discussion board in hopes that someone in here can assist you with this question.  I think we may need to change the name of the original board you posted this to as it is confusing.  It is called "Data Ingestion...", but is specifically aimed at tools like Sqoop, Flume, etc.  WebHDFS and distcp are HDFS components, so I hope you can find some help in this board.

 

Regards,

 

Clint