Member since
04-14-2018
2
Posts
0
Kudos Received
0
Solutions
09-01-2022
11:01 PM
We have a requirement where we need to copy the diff of the data from source cluster to destination cluster. We have been using the diff option successfully. But as you may already know the diff option will only work if the destination data is not modified after the snapshot is taken. Off late we have to deal with a situation that the destination data is modified. For adjusting to the changes in the destination we are using rdiff to restore data on the destination to a known snapshot point on the destination. Meaning the same known snapshot also exists on the source side using rdiff on the destination. After the restore step is performed we copy the data using diff option from the source. Here is the algorithm: 1> Create snapshot s1 on the destination. 2> Perform rdiff hadoop distcp -rdiff s1 s0 /destination /destination 3> Recreate snapshot s0 on /destination delete s0 create s0 4>.Create snapshot on source create s2 5> Perform diff hadoop distcp -diff s0 s2 /source /destination This works perfectly for us. However, in step 3 above I recreate the snapshot s0 on the destination based on this post https://community.cloudera.com/t5/Support-Questions/distcp-snapshot-managemnt/td-p/53733 I want to make sure is this really required?
... View more
Labels:
- Labels:
-
Apache Hadoop