Can anyone provide me syntax and sample example for checking the difference between two snapshot and move that difference data to target cluster using distcp?
I have two clusters clusterA and ClusterB. I have recently built ClusterB and moving all the data from clusterA to clusterB. Before moving the data I have taken the snapshot on cluster A. During the interval of transferring the data, as the cluster A is still in active state the data got changed. Now I want to move only changed data from cluster A to cluster B. can someone provide me syntax with simple example like how can I get difference and move the changed data.
Thanks in advance.
@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:
From the article:
hdfs dfsadmin -allowSnapshot <path>
hdfs dfsadmin -allowSnapshot /data/a
hdfs dfs -createSnapshot /data/a s1
hadoop distcp /data/a/.snapshot/s1 /data/a_target
hdfs dfsadmin -allowSnapshot /data/a_target
hdfs dfs -createSnapshot /data/a_target s1
hdfs dfs -createSnapshot /data/a s2
hdfs snapshotDiff /data/a s1 s2
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
hdfs dfs -createSnapshot /data/a_target s2
That's it. You've completed the cycle. Rinse and repeat.
Guys there is challenge I am facing .. when I am running the snapshotdiff from a remote cluster it is failing with snapshot not found error even though it is available .. do we have any solution for this .. we built a DR cluster and running distcp from DR to utilize the DR resources instead of overloading the PROD .. any solution how this can be achived..