Support Questions

Find answers, ask questions, and share your expertise

distcp update difference between two snapshot syntax

avatar

Hi,

Can anyone provide me syntax and sample example for checking the difference between two snapshot and move that difference data to target cluster using distcp?

AIM:

I have two clusters clusterA and ClusterB. I have recently built ClusterB and moving all the data from clusterA to clusterB. Before moving the data I have taken the snapshot on cluster A. During the interval of transferring the data, as the cluster A is still in active state the data got changed. Now I want to move only changed data from cluster A to cluster B. can someone provide me syntax with simple example like how can I get difference and move the changed data.

Thanks in advance.

1 ACCEPTED SOLUTION

avatar

@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:

https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html

From the article:

  • Source must support 'snapshots'
hdfs dfsadmin -allowSnapshot <path>
  • Target is "read-only"
  • Target, after initial baseline 'distcp' sync needs to support snapshots.

Process

  • Identify the source and target 'parent' directory
    • Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
  • Allow snapshots on the source directory
hdfs dfsadmin -allowSnapshot /data/a
  • Create a Snapshot of /data/a
hdfs dfs -createSnapshot /data/a s1
  • Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.
hadoop distcp /data/a/.snapshot/s1 /data/a_target
  • Allow snapshots on the newly create target directory
hdfs dfsadmin -allowSnapshot /data/a_target
  • At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
  • Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline
hdfs dfs -createSnapshot /data/a_target s1
  • Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
  • Take a new snapshot of /data/a
hdfs dfs -createSnapshot /data/a s2
  • Just for fun, check on whats changed between the two snapshots
hdfs snapshotDiff /data/a s1 s2
  • Ok, now let's migrate the changes to /data/a_target
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
  • When that's completed, finish the cycle by creating a matching snapshot on /data/a_target
hdfs dfs -createSnapshot /data/a_target s2

That's it. You've completed the cycle. Rinse and repeat.

View solution in original post

10 REPLIES 10

avatar

@SBandaru -

Lets say s1 was the earlier snapshot. You will need to create the latest snapshot (say s2) on source cluster like

/usr/hdp/current/hadoop-hdfs-client/bin/hdfs dfs -createSnapshot /tmp/source s2

And then run distcp like below:

/usr/hdp/current/hadoop-client/bin/hadoop distcp -update -diff s1 s2  /tmp/source /tmp/target

Hope this helps

avatar

@Namit Maheshwari

Thanks for the quick response, I have tried the same way but I'm getting below error message. Any help is highly appreciated.

17/03/30 21:39:38 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getSnapshotDiffReport over null. Not retrying because try once and fail.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.SnapshotException): Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru
        at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.getSnapshotByName(DirectorySnapshottableFeature.java:285)
        at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiff(DirectorySnapshottableFeature.java:257)
        at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff(SnapshotManager.java:372)
        at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.getSnapshotDiffReport(FSDirSnapshotOp.java:155)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport(FSNamesystem.java:7674)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getSnapshotDiffReport(NameNodeRpcServer.java:1792)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getSnapshotDiffReport(ClientNamenodeProtocolServerSideTranslatorPB.java:1149)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2273)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2267)

17/03/30 21:57:46 WARN tools.DistCp: Failed to compute snapshot diff on hdfs://hadoop.hortonworks.com:8020/tmp/sbandaru
org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru
        at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.getSnapshotByName(DirectorySnapshottableFeature.java:285)
        at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiff(DirectorySnapshottableFeature.java:257)
        at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff(SnapshotManager.java:372)
        at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.getSnapshotDiffReport(FSDirSnapshotOp.java:155)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport(FSNamesystem.java:7674)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getSnapshotDiffReport(NameNodeRpcServer.java:1792)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getSnapshotDiffReport(ClientNamenodeProtocolServerSideTranslatorPB.java:1149)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)

avatar

@SBandaru - Its not able to find the snapshot of directory :

Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru

Can you please ping how you created the snapshot, what was the location of the snapshot, and the command you issued for running distcp.

avatar

@Namit Maheshwari

Below is the requested information.

[sbandaru@hadoop ~]$ hdfs dfs -ls /user/sbandaru/.snapshot
Found 3 items
drwxr-x---   - sbandaru sbandaru          0 2017-03-30 11:38 /user/sbandaru/.snapshot/afterdistcp
drwxr-x---   - sbandaru sbandaru          0 2016-11-08 19:57 /user/sbandaru/.snapshot/sbandaru
drwxr-x---   - sbandaru sbandaru          0 2016-11-08 19:57 /user/sbandaru/.snapshot/sbandaru2
[sbandaru@hadoop ~]$


[sbandaru@hadoop ~]$ hadoop --loglevel DEBUG distcp -update -diff sbandaru afterdistcp /user/sbandaru hdfs://hadoop.hortonworks.com:8020/tmp/sbandaru

I have created snapshot on /user/sbandaru directory then I'm trying to get difference of old and new snapsort and move that difference to a location /tmp/sbnadaru.

avatar

@Namit Maheshwari

"WARN tools.DistCp: Failed to compute snapshot diff on hdfs://hadoop.hortonworks.com:8020/tmp/sbandaru"

Above one is part of the error message which is my target location, why it's trying to find the snapshot in target location ?

avatar
Contributor

I have the same issue when trying to compute the diff.

hadoop distcp -diff s1 s2 -update /data/a /data/a_target

/data/a_target is on another cluster. s1 (yesterdays snap) and s2 (todays snap) on the first cluster location are side by side of course. I wonder if the diff needs to the snapshot filename only, and not the absolute path.

avatar
Contributor

Hmm... so it does* appear you need to provide just* the filename for S1 and S2. interesting

avatar

@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:

https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html

From the article:

  • Source must support 'snapshots'
hdfs dfsadmin -allowSnapshot <path>
  • Target is "read-only"
  • Target, after initial baseline 'distcp' sync needs to support snapshots.

Process

  • Identify the source and target 'parent' directory
    • Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
  • Allow snapshots on the source directory
hdfs dfsadmin -allowSnapshot /data/a
  • Create a Snapshot of /data/a
hdfs dfs -createSnapshot /data/a s1
  • Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.
hadoop distcp /data/a/.snapshot/s1 /data/a_target
  • Allow snapshots on the newly create target directory
hdfs dfsadmin -allowSnapshot /data/a_target
  • At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
  • Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline
hdfs dfs -createSnapshot /data/a_target s1
  • Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
  • Take a new snapshot of /data/a
hdfs dfs -createSnapshot /data/a s2
  • Just for fun, check on whats changed between the two snapshots
hdfs snapshotDiff /data/a s1 s2
  • Ok, now let's migrate the changes to /data/a_target
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
  • When that's completed, finish the cycle by creating a matching snapshot on /data/a_target
hdfs dfs -createSnapshot /data/a_target s2

That's it. You've completed the cycle. Rinse and repeat.

avatar

@SBandaru - Is your issue resolved. Or you need any further help here.