Support Questions

Find answers, ask questions, and share your expertise

distcp update difference between two snapshot syntax

avatar

Hi,

Can anyone provide me syntax and sample example for checking the difference between two snapshot and move that difference data to target cluster using distcp?

AIM:

I have two clusters clusterA and ClusterB. I have recently built ClusterB and moving all the data from clusterA to clusterB. Before moving the data I have taken the snapshot on cluster A. During the interval of transferring the data, as the cluster A is still in active state the data got changed. Now I want to move only changed data from cluster A to cluster B. can someone provide me syntax with simple example like how can I get difference and move the changed data.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar

@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:

https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html

From the article:

  • Source must support 'snapshots'
hdfs dfsadmin -allowSnapshot <path>
  • Target is "read-only"
  • Target, after initial baseline 'distcp' sync needs to support snapshots.

Process

  • Identify the source and target 'parent' directory
    • Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
  • Allow snapshots on the source directory
hdfs dfsadmin -allowSnapshot /data/a
  • Create a Snapshot of /data/a
hdfs dfs -createSnapshot /data/a s1
  • Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.
hadoop distcp /data/a/.snapshot/s1 /data/a_target
  • Allow snapshots on the newly create target directory
hdfs dfsadmin -allowSnapshot /data/a_target
  • At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
  • Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline
hdfs dfs -createSnapshot /data/a_target s1
  • Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
  • Take a new snapshot of /data/a
hdfs dfs -createSnapshot /data/a s2
  • Just for fun, check on whats changed between the two snapshots
hdfs snapshotDiff /data/a s1 s2
  • Ok, now let's migrate the changes to /data/a_target
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
  • When that's completed, finish the cycle by creating a matching snapshot on /data/a_target
hdfs dfs -createSnapshot /data/a_target s2

That's it. You've completed the cycle. Rinse and repeat.

View solution in original post

10 REPLIES 10

avatar
Rising Star

Guys there is challenge I am facing .. when I am running the snapshotdiff from a remote cluster it is failing with snapshot not found error even though it is available .. do we have any solution for this .. we built a DR cluster and running distcp from DR to utilize the DR resources instead of overloading the PROD .. any solution how this can be achived..