I have 2 clusters Hadoop, I want copy the data of cluster 1 in cluster 2, I searched on articles, forums... which tools I should use to copy this data. I found that I can use Falcon, but I do not understand how can use it. Someone please can help me, by a guide, article or a practical work explain me how can I do this workflow by Falcon ?
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
Copying Data from Cluster1 to Cluster2
hadoop distcp hdfs://cluster1:8020/data/in/hdfs/ hdfs://cluster2:8020/new/path/in/hdfs/
Copying between 2 HA clusters
Using distcp between two HA clusters would be to identify the current active NameNode and run distcp like you would with two clusters without HA:
hadoop distcp hdfs://active1:8020/path hdfs://active2:8020/path
Here is a documentation from Apache