- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
distcp update difference between two snapshot syntax
- Labels:
-
Apache Hadoop
Created ‎03-30-2017 09:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Can anyone provide me syntax and sample example for checking the difference between two snapshot and move that difference data to target cluster using distcp?
AIM:
I have two clusters clusterA and ClusterB. I have recently built ClusterB and moving all the data from clusterA to clusterB. Before moving the data I have taken the snapshot on cluster A. During the interval of transferring the data, as the cluster A is still in active state the data got changed. Now I want to move only changed data from cluster A to cluster B. can someone provide me syntax with simple example like how can I get difference and move the changed data.
Thanks in advance.
Created ‎03-31-2017 11:47 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:
https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html
From the article:
- Source must support 'snapshots'
hdfs dfsadmin -allowSnapshot <path>
- Target is "read-only"
- Target, after initial baseline 'distcp' sync needs to support snapshots.
Process
- Identify the source and target 'parent' directory
- Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
- Allow snapshots on the source directory
hdfs dfsadmin -allowSnapshot /data/a
- Create a Snapshot of /data/a
hdfs dfs -createSnapshot /data/a s1
- Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.
hadoop distcp /data/a/.snapshot/s1 /data/a_target
- Allow snapshots on the newly create target directory
hdfs dfsadmin -allowSnapshot /data/a_target
- At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
- Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline
hdfs dfs -createSnapshot /data/a_target s1
- Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
- Take a new snapshot of /data/a
hdfs dfs -createSnapshot /data/a s2
- Just for fun, check on whats changed between the two snapshots
hdfs snapshotDiff /data/a s1 s2
- Ok, now let's migrate the changes to /data/a_target
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
- When that's completed, finish the cycle by creating a matching snapshot on /data/a_target
hdfs dfs -createSnapshot /data/a_target s2
That's it. You've completed the cycle. Rinse and repeat.
Created ‎03-30-2017 09:33 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Lets say s1 was the earlier snapshot. You will need to create the latest snapshot (say s2) on source cluster like
/usr/hdp/current/hadoop-hdfs-client/bin/hdfs dfs -createSnapshot /tmp/source s2
And then run distcp like below:
/usr/hdp/current/hadoop-client/bin/hadoop distcp -update -diff s1 s2 /tmp/source /tmp/target
Hope this helps
Created ‎03-31-2017 02:50 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the quick response, I have tried the same way but I'm getting below error message. Any help is highly appreciated.
17/03/30 21:39:38 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getSnapshotDiffReport over null. Not retrying because try once and fail. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.SnapshotException): Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.getSnapshotByName(DirectorySnapshottableFeature.java:285) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiff(DirectorySnapshottableFeature.java:257) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff(SnapshotManager.java:372) at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.getSnapshotDiffReport(FSDirSnapshotOp.java:155) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport(FSNamesystem.java:7674) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getSnapshotDiffReport(NameNodeRpcServer.java:1792) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getSnapshotDiffReport(ClientNamenodeProtocolServerSideTranslatorPB.java:1149) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2273) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2267) 17/03/30 21:57:46 WARN tools.DistCp: Failed to compute snapshot diff on hdfs://hadoop.hortonworks.com:8020/tmp/sbandaru org.apache.hadoop.hdfs.protocol.SnapshotException: Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.getSnapshotByName(DirectorySnapshottableFeature.java:285) at org.apache.hadoop.hdfs.server.namenode.snapshot.DirectorySnapshottableFeature.computeDiff(DirectorySnapshottableFeature.java:257) at org.apache.hadoop.hdfs.server.namenode.snapshot.SnapshotManager.diff(SnapshotManager.java:372) at org.apache.hadoop.hdfs.server.namenode.FSDirSnapshotOp.getSnapshotDiffReport(FSDirSnapshotOp.java:155) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getSnapshotDiffReport(FSNamesystem.java:7674) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getSnapshotDiffReport(NameNodeRpcServer.java:1792) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getSnapshotDiffReport(ClientNamenodeProtocolServerSideTranslatorPB.java:1149) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
Created ‎03-31-2017 03:01 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@SBandaru - Its not able to find the snapshot of directory :
Cannot find the snapshot of directory /tmp/sbandaru with name sbandaru
Can you please ping how you created the snapshot, what was the location of the snapshot, and the command you issued for running distcp.
Created ‎03-31-2017 03:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Below is the requested information.
[sbandaru@hadoop ~]$ hdfs dfs -ls /user/sbandaru/.snapshot Found 3 items drwxr-x--- - sbandaru sbandaru 0 2017-03-30 11:38 /user/sbandaru/.snapshot/afterdistcp drwxr-x--- - sbandaru sbandaru 0 2016-11-08 19:57 /user/sbandaru/.snapshot/sbandaru drwxr-x--- - sbandaru sbandaru 0 2016-11-08 19:57 /user/sbandaru/.snapshot/sbandaru2 [sbandaru@hadoop ~]$ [sbandaru@hadoop ~]$ hadoop --loglevel DEBUG distcp -update -diff sbandaru afterdistcp /user/sbandaru hdfs://hadoop.hortonworks.com:8020/tmp/sbandaru
I have created snapshot on /user/sbandaru directory then I'm trying to get difference of old and new snapsort and move that difference to a location /tmp/sbnadaru.
Created ‎03-31-2017 02:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"WARN tools.DistCp: Failed to compute snapshot diff on hdfs://hadoop.hortonworks.com:8020/tmp/sbandaru"
Above one is part of the error message which is my target location, why it's trying to find the snapshot in target location ?
Created ‎09-11-2017 04:56 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the same issue when trying to compute the diff.
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
/data/a_target is on another cluster. s1 (yesterdays snap) and s2 (todays snap) on the first cluster location are side by side of course. I wonder if the diff needs to the snapshot filename only, and not the absolute path.
Created ‎09-11-2017 05:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmm... so it does* appear you need to provide just* the filename for S1 and S2. interesting
Created ‎03-31-2017 11:47 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:
https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html
From the article:
- Source must support 'snapshots'
hdfs dfsadmin -allowSnapshot <path>
- Target is "read-only"
- Target, after initial baseline 'distcp' sync needs to support snapshots.
Process
- Identify the source and target 'parent' directory
- Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
- Allow snapshots on the source directory
hdfs dfsadmin -allowSnapshot /data/a
- Create a Snapshot of /data/a
hdfs dfs -createSnapshot /data/a s1
- Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.
hadoop distcp /data/a/.snapshot/s1 /data/a_target
- Allow snapshots on the newly create target directory
hdfs dfsadmin -allowSnapshot /data/a_target
- At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
- Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline
hdfs dfs -createSnapshot /data/a_target s1
- Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
- Take a new snapshot of /data/a
hdfs dfs -createSnapshot /data/a s2
- Just for fun, check on whats changed between the two snapshots
hdfs snapshotDiff /data/a s1 s2
- Ok, now let's migrate the changes to /data/a_target
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
- When that's completed, finish the cycle by creating a matching snapshot on /data/a_target
hdfs dfs -createSnapshot /data/a_target s2
That's it. You've completed the cycle. Rinse and repeat.
Created ‎04-05-2017 09:02 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@SBandaru - Is your issue resolved. Or you need any further help here.
