- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
rdiff usage in distcp
- Labels:
-
Apache Hadoop
Created 09-01-2022 11:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have a requirement where we need to copy the diff of the data from source cluster to destination cluster.
We have been using the diff option successfully. But as you may already know the diff option will only work if the destination data is not modified after the snapshot is taken. Off late we have to deal with a situation that the destination data is modified. For adjusting to the changes in the destination we are using rdiff to restore data on the destination to a known snapshot point on the destination. Meaning the same known snapshot also exists on the source side using rdiff on the destination. After the restore step is performed we copy the data using diff option from the source.
Here is the algorithm:
1> Create snapshot s1 on the destination.
2> Perform rdiff
hadoop distcp -rdiff s1 s0 /destination /destination
3> Recreate snapshot s0 on /destination
delete s0
create s0
4>.Create snapshot on source
create s2
5> Perform diff
hadoop distcp -diff s0 s2 /source /destination
This works perfectly for us. However, in step 3 above I recreate the snapshot s0 on the destination based on this post https://community.cloudera.com/t5/Support-Questions/distcp-snapshot-managemnt/td-p/53733
I want to make sure is this really required?
