- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
distcp update difference between two snapshot syntax
- Labels:
-
Apache Hadoop
Created ‎03-30-2017 09:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can anyone provide me syntax and sample example for checking the difference between two snapshot and move that difference data to target cluster using distcp?
AIM:
I have two clusters clusterA and ClusterB. I have recently built ClusterB and moving all the data from clusterA to clusterB. Before moving the data I have taken the snapshot on cluster A. During the interval of transferring the data, as the cluster A is still in active state the data got changed. Now I want to move only changed data from cluster A to cluster B. can someone provide me syntax with simple example like how can I get difference and move the changed data.
Thanks in advance.
Created ‎03-31-2017 11:47 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:
https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html
From the article:
- Source must support 'snapshots'
hdfs dfsadmin -allowSnapshot <path>
- Target is "read-only"
- Target, after initial baseline 'distcp' sync needs to support snapshots.
Process
- Identify the source and target 'parent' directory
- Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
- Allow snapshots on the source directory
hdfs dfsadmin -allowSnapshot /data/a
- Create a Snapshot of /data/a
hdfs dfs -createSnapshot /data/a s1
- Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.
hadoop distcp /data/a/.snapshot/s1 /data/a_target
- Allow snapshots on the newly create target directory
hdfs dfsadmin -allowSnapshot /data/a_target
- At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
- Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline
hdfs dfs -createSnapshot /data/a_target s1
- Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
- Take a new snapshot of /data/a
hdfs dfs -createSnapshot /data/a s2
- Just for fun, check on whats changed between the two snapshots
hdfs snapshotDiff /data/a s1 s2
- Ok, now let's migrate the changes to /data/a_target
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
- When that's completed, finish the cycle by creating a matching snapshot on /data/a_target
hdfs dfs -createSnapshot /data/a_target s2
That's it. You've completed the cycle. Rinse and repeat.
Created ‎07-16-2018 11:33 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Guys there is challenge I am facing .. when I am running the snapshotdiff from a remote cluster it is failing with snapshot not found error even though it is available .. do we have any solution for this .. we built a DR cluster and running distcp from DR to utilize the DR resources instead of overloading the PROD .. any solution how this can be achived..

- « Previous
-
- 1
- 2
- Next »