Support Questions

bandarusridhar1 · ‎03-30-2017

Hi,

Can anyone provide me syntax and sample example for checking the difference between two snapshot and move that difference data to target cluster using distcp?

AIM:

I have two clusters clusterA and ClusterB. I have recently built ClusterB and moving all the data from clusterA to clusterB. Before moving the data I have taken the snapshot on cluster A. During the interval of transferring the data, as the cluster A is still in active state the data got changed. Now I want to move only changed data from cluster A to cluster B. can someone provide me syntax with simple example like how can I get difference and move the changed data.

Thanks in advance.

namaheshwari · ‎03-31-2017

@SBandaru - Below is an excellent article on HCC explaining distcp with Snapshots:

https://community.hortonworks.com/articles/71775/managing-hadoop-dr-with-distcp-and-snapshots.html

From the article:

Source must support 'snapshots'

hdfs dfsadmin -allowSnapshot <path>

Target is "read-only"
Target, after initial baseline 'distcp' sync needs to support snapshots.

Process

Identify the source and target 'parent' directory
- Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
Allow snapshots on the source directory

hdfs dfsadmin -allowSnapshot /data/a

Create a Snapshot of /data/a

hdfs dfs -createSnapshot /data/a s1

Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.

hadoop distcp /data/a/.snapshot/s1 /data/a_target

Allow snapshots on the newly create target directory

hdfs dfsadmin -allowSnapshot /data/a_target

At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline

hdfs dfs -createSnapshot /data/a_target s1

Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
Take a new snapshot of /data/a

hdfs dfs -createSnapshot /data/a s2

Just for fun, check on whats changed between the two snapshots

hdfs snapshotDiff /data/a s1 s2

Ok, now let's migrate the changes to /data/a_target

hadoop distcp -diff s1 s2 -update /data/a /data/a_target

When that's completed, finish the cycle by creating a matching snapshot on /data/a_target

hdfs dfs -createSnapshot /data/a_target s2

That's it. You've completed the cycle. Rinse and repeat.

View solution in original post

Sreedhar_ch · ‎07-16-2018

Guys there is challenge I am facing .. when I am running the snapshotdiff from a remote cluster it is failing with snapshot not found error even though it is available .. do we have any solution for this .. we built a DR cluster and running distcp from DR to utilize the DR resources instead of overloading the PROD .. any solution how this can be achived..

Cloudera Community

Support Questions

distcp update difference between two snapshot syntax

Process

Managing Hadoop DR with 'distcp' and 'snapshots'

distcp update vs distcp update with snapshots

Auth-to-local Rules Syntax

Hadoop Distcp -update skips file

Distcp with snapshot diff copy doesn't work with "...

Kerberos cross realm trust for distcp

HDFS Snapshots - 2) Operations

NiFi: How to detect updates to S3 files and insert...

Can Distcp consider modification times when using ...

HDFS Snapshots - 1) Overview