Created 03-30-2017 09:15 PM
Can anyone provide me syntax and sample example for checking the difference between two snapshot and move that difference data to target cluster using distcp?
I have two clusters clusterA and ClusterB. I have recently built ClusterB and moving all the data from clusterA to clusterB. Before moving the data I have taken the snapshot on cluster A. During the interval of transferring the data, as the cluster A is still in active state the data got changed. Now I want to move only changed data from cluster A to cluster B. can someone provide me syntax with simple example like how can I get difference and move the changed data.
Thanks in advance.
Created 03-31-2017 11:47 PM
@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:
From the article:
hdfs dfsadmin -allowSnapshot <path>
hdfs dfsadmin -allowSnapshot /data/a
hdfs dfs -createSnapshot /data/a s1
hadoop distcp /data/a/.snapshot/s1 /data/a_target
hdfs dfsadmin -allowSnapshot /data/a_target
hdfs dfs -createSnapshot /data/a_target s1
hdfs dfs -createSnapshot /data/a s2
hdfs snapshotDiff /data/a s1 s2
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
hdfs dfs -createSnapshot /data/a_target s2
That's it. You've completed the cycle. Rinse and repeat.
Created 03-30-2017 09:33 PM
Lets say s1 was the earlier snapshot. You will need to create the latest snapshot (say s2) on source cluster like
/usr/hdp/current/hadoop-hdfs-client/bin/hdfs dfs -createSnapshot /tmp/source s2
And then run distcp like below:
/usr/hdp/current/hadoop-client/bin/hadoop distcp -update -diff s1 s2 /tmp/source /tmp/target
Hope this helps
Created 03-31-2017 02:50 AM
Thanks for the quick response, I have tried the same way but I'm getting below error message. Any help is highly appreciated.
17/03/30 21:39:38 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getSnapshotDiffReport over null. Not retrying because try once and fail. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.SnapshotException): Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.getSnapshotByName( at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiff( at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff( at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.getSnapshotDiffReport( at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport( at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getSnapshotDiffReport( at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getSnapshotDiffReport( at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod( at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ at org.apache.hadoop.ipc.RPC$ at org.apache.hadoop.ipc.Server$Handler$ at org.apache.hadoop.ipc.Server$Handler$ at Method) at at at org.apache.hadoop.ipc.Server$ 17/03/30 21:57:46 WARN tools.DistCp: Failed to compute snapshot diff on hdfs:// org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.getSnapshotByName( at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiff( at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff( at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.getSnapshotDiffReport( at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport( at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getSnapshotDiffReport( at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getSnapshotDiffReport( at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod( at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$
Created 03-31-2017 03:01 AM
@SBandaru - Its not able to find the snapshot of directory :
Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru
Can you please ping how you created the snapshot, what was the location of the snapshot, and the command you issued for running distcp.
Created 03-31-2017 03:52 AM
Below is the requested information.
[sbandaru@hadoop ~]$ hdfs dfs -ls /user/sbandaru/.snapshot Found 3 items drwxr-x--- - sbandaru sbandaru 0 2017-03-30 11:38 /user/sbandaru/.snapshot/afterdistcp drwxr-x--- - sbandaru sbandaru 0 2016-11-08 19:57 /user/sbandaru/.snapshot/sbandaru drwxr-x--- - sbandaru sbandaru 0 2016-11-08 19:57 /user/sbandaru/.snapshot/sbandaru2 [sbandaru@hadoop ~]$ [sbandaru@hadoop ~]$ hadoop --loglevel DEBUG distcp -update -diff sbandaru afterdistcp /user/sbandaru hdfs://
I have created snapshot on /user/sbandaru directory then I'm trying to get difference of old and new snapsort and move that difference to a location /tmp/sbnadaru.
Created 03-31-2017 02:52 PM
"WARN tools.DistCp: Failed to compute snapshot diff on hdfs://"
Above one is part of the error message which is my target location, why it's trying to find the snapshot in target location ?
Created 09-11-2017 04:56 PM
I have the same issue when trying to compute the diff.
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
/data/a_target is on another cluster. s1 (yesterdays snap) and s2 (todays snap) on the first cluster location are side by side of course. I wonder if the diff needs to the snapshot filename only, and not the absolute path.
Created 09-11-2017 05:08 PM
Hmm... so it does* appear you need to provide just* the filename for S1 and S2. interesting
Created 03-31-2017 11:47 PM
@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:
From the article:
hdfs dfsadmin -allowSnapshot <path>
hdfs dfsadmin -allowSnapshot /data/a
hdfs dfs -createSnapshot /data/a s1
hadoop distcp /data/a/.snapshot/s1 /data/a_target
hdfs dfsadmin -allowSnapshot /data/a_target
hdfs dfs -createSnapshot /data/a_target s1
hdfs dfs -createSnapshot /data/a s2
hdfs snapshotDiff /data/a s1 s2
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
hdfs dfs -createSnapshot /data/a_target s2
That's it. You've completed the cycle. Rinse and repeat.
Created 04-05-2017 09:02 PM
@SBandaru - Is your issue resolved. Or you need any further help here.