Support Questions

Find answers, ask questions, and share your expertise

rdiff usage in distcp

New Contributor

We have a requirement where we need to copy the diff of the data from source cluster to destination cluster.

We have been using the diff option successfully. But as you may already know the diff option will only work if the destination data is not modified after the snapshot is taken. Off late we have to deal with a situation that the destination data is modified. For adjusting to the changes in the destination we are using rdiff to restore data on the destination to a known snapshot point on the destination. Meaning the same known snapshot also exists on the source side using rdiff on the destination. After the restore step is performed we copy the data using diff option from the source.


Here is the algorithm:

1> Create snapshot s1 on the destination.

2> Perform rdiff

            hadoop distcp -rdiff s1 s0 /destination /destination

3> Recreate snapshot s0 on /destination

           delete s0

           create s0

4>.Create snapshot on source 

           create s2

5> Perform diff

          hadoop distcp -diff s0 s2 /source /destination


This works perfectly for us. However, in step 3 above I recreate the snapshot s0 on the destination based on this post

I want to make sure is this really required?