Reply
Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

[ Edited ]

in the documentation it stated that i should have 2 snapshots at the target to restore the snapshot.

when i use rdiff s1 s0 source_path target_path, can you please elaborate if s1 and s0 is the source or the destination snapshots?

Please find my comments:

This option is valid only with -update option and the following conditions should be satisfied.

1. Two snapshots and have been created on the target FS, and is older than . No change has been made on target since was created on the target.

-- Why i need 2 snapshots at the target? why no changes has been made on the target? isn't this the cause that we are using -rdiff when the target has been changed since snapshot s0?

-- In your above comments you stated that after rdiff, i can create s0 at the target, isn't s0 already at the target and i have to have it at the target to perform the rdiff?

Cloudera Employee
Posts: 13
Registered: ‎08-20-2015

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

I think the confusion is that, in a cluster, we not only have snapshots, but also "current state", which is continously being modified. When we say to revert to a snapshot, we meant to modify the "current state" to make it look the same as the content of a snapshot.

 

If we take a snapshot x on the "current state", then x is the same as the "current state", if we have not make further change on the "current state". Suppose we take snapshot x at "current state", then make some changes at "current state", which become the new "current state", we want to revert the changes and make the "current state" go back to x, we take a snapshot y at the new "current state" first, for the convenience of calculating snapshot diff,  and for the convenience of making sure there is no further change made on y.  After we revert the current state to x, we can choose to delete snapshot y.

 

After we make the "current state" go back to x, the content of the "current state" is the same as x. However, due to  sutble implementation details of snapshot, snapshot diff calculation would still think there is difference between x and "current state". To fix that at this point, we can now delete snapshot x, and create a new snapshot x on the "current state". So now "current state" is the same as snapshot x. 

 

Please quote the statement from the doc if you want to comment on it, so that we can see exactly which statement you were referring to.

 

Thanks.

 

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

My target folder is not modified but sync with the source folder using the
distcp.

So if I understood you when I issue -rdiff s1 so source_file
destination_file, The s1 and s0 are 2 snapshot in the destination and it
will revert the destination to s0 and align s0 in both source and
destination.

Or in rdiff it should be s1 s0 destination_folder source_file?
Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

any help here is much appreciated ...

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

I got it.

 

using the snapshot restore should be:

 

hadoop distcp -rdiff source destination s1 s0.

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

So just to make things more clear and useful for who is going to use this feature or using the distcp diff in his current CDH version.

 

1- If you are using the snapshot diff in your current version (prior to CDH5.10 or CDH5.9.1), the distcp was able to overcome the distcp failure by listing all the source dir and run the disctp from scratch), in the new versions distcp will not overcomes such issue when distcp fail or interrupted during the run, and will fail all time in the next runs.

 

2- To overcome this you have to use the snapshot restore and restore the your destination hdfs folder to the state before the distcp failure.

 

3- The distcp snapshot command should be like this: hadoop disctp -rdiff s1 s0 source_folder destination_folder, and here the s1 is a snapshot at the destination and newer than s0, what will happen after the success of the distcp -rdiff that the destination will be restore to s0 which is the state before the distcp failure.

 

4- The most challenging thing will be how to manage the snaphot cycle during the distcp diff and distcp rdiff.

 

example how i'm doing this and working to enhance it

 

========================================

 

#!/bin/bash -x


hdfs dfs -createSnapshot /fawzesource s1
hadoop distcp -diff s0 s1 /fawzesource /fawzedestination
if [ $? -eq 0 ]
then

hdfs dfs -createSnapshot /fawzedestination s1
hdfs dfs -deleteSnapshot /fawzesource s0
hdfs dfs -renameSnapshot //fawzesource s1 s0
hdfs dfs -deleteSnapshot //fawzedestination s0
hdfs dfs -renameSnapshot //fawzedestination s1 s0
else
hdfs dfs -deleteSnapshot //fawzedestination s2
hdfs dfs -createSnapshot //fawzedestination s2
hadoop distcp -rdiff s2 s0 //fawzesource //fawzedestination
if [ $? -eq 0 ]
then
hdfs dfs -deleteSnapshot //fawzedestination s2
hdfs dfs -deleteSnapshot //fawzedestination s0
hdfs dfs -createSnapshot //fawzedestination s0


fi


fi

Highlighted
Cloudera Employee
Posts: 13
Registered: ‎08-20-2015

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

HI Fawze,

 

Sorry I was out for a few days. Nice summary!

 

--Yongjun

Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

@Yongjun Zhang after testing the snapshot cycle, we found it's a complex task to manage the snapshot cycle for both diff and rdiff and there are alot of cases that we will missing data, so we decided to stay with the current solution with using the diff only.

 

Is this valid case in CDH5.10 and above, can i force the distcp not to use the rdiff and automaticlly list all the source files in case where the destination changed since the last snapshot?

Cloudera Employee
Posts: 13
Registered: ‎08-20-2015

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

Hi Fawze,

Thanks for the update.

Did you guys try to find out why the data is missing? Is the missing file
newly created at source after snapshot s0 is created? or the file is in
snapshot s0 of source but not in target after you do rdiff?

Would you please elaborate the cases in "there are alot of cases that we
will missing data" if you know?

I hope we can have an understanding why there are missing files.

Thanks.
Expert Contributor
Posts: 158
Registered: ‎01-25-2017

Re: Killing the Distcp which running over snapshot listing all snapshottable path in the next run

Hi @Yongjun Zhang the missing data is not resulted from distcp, it resulted from the snapshot management, since we have several delete,create and rename snapshots, sometimes dueto network issue one of the commands skipped and this resulted to incsistance of the data.

Announcements