I want to copy the data between two clusters. Any suggestions or methods for fast copying of the data then the usual speed.?
You can use the distcp over a distributed protocol like hdfs or webhdfs. Make sure to have enough mappers to fully saturate the drives and datanodes to get max speed.
I'd add one more thing: if the two clusters are not adjacent network-wise, but instead use some network you have to share with other people, use -bandwidth to limit the actual bandwidth of distcp.
Why? Between two large clusters, the IO bandwidth off disk will actually be higher than the network backbone bandwidth, so saturate it. This will make your network ops team very unhappy. And yes, it has happened at scale.