Community Articles

dstreev · ‎12-15-2016

The Problem

Traditional 'distcp' from one directory to another or from cluster to cluster is quite useful in moving massive amounts of data, once. But what happens when you need to "update" a target directory or cluster with only the changes made since the last 'distcp' had run. That becomes a very tricky scenario. 'distcp' offers an '-update' flag, which is suppose to move only the files that have changed. In this case 'distcp' will pull a list of files and directories from the source and targets, compare them and then build a migration plan.

The problem: It's an expensive and time-consuming task. Furthermore, the process is not atomic. First, the cost of gathering a list of files and directories, along with their metadata is expensive when you're considering sources with millions of file and directory objects. And this cost is incurred on both the source and target namenode's, resulting in quite a bit of pressure on those systems.

It's up to 'distcp' to reconcile the difference between the source and target, which is very expensive. When it's finally complete, only then does the process start to move data. And if data changes while the process is running, those changes can impact the transfer and lead to failure and partial migration.

The Solution

The process needs to be atomic, and it needs to be efficient. With Hadoop 2.0, HDFS introduce "snapshots." HDFS "snapshots" are a point-in-time copy of the directories metadata. The copy is stored in a hidden location and maintains references to all of the immutable filesystem objects. Creating a snapshot is atomic, and the characteristics of HDFS (being immutable) means that an image of a directories metadata doesn't require an addition copy of the underlying data.

Another feature of snapshots is the ability to efficiently calculate changes between 'any' two snapshots on the same directory. Using 'hdfs snapshotDiff ', you can build a list of "changes" between these two point-in-time references.

For Example

[hdfs@m3 ~]$ hdfs snapshotDiff /user/dstreev/stats s1 s2
Difference between snapshot s1 and snapshot s2 under directory /user/dstreev/stats:
M       .
+       ./attempt
M       ./namenode/fs_state/2016-12.txt
M       ./namenode/nn_info/2016-12.txt
M       ./namenode/top_user_ops/2016-12.txt
M       ./scheduler/queue_paths/2016-12.txt
M       ./scheduler/queue_usage/2016-12.txt
M       ./scheduler/queues/2016-12.txt

Let's take the 'distcp' update concept and supercharge it with the efficiency of snapshots. Now you have a solution that will scale far beyond the original 'distcp -update.' and in the process remove the burden and load from the namenode's previously encountered.

Pre-Requisites and Requirements

Source must support 'snapshots'

hdfs dfsadmin -allowSnapshot <path>

Target is "read-only"
Target, after initial baseline 'distcp' sync needs to support snapshots.

Process

Identify the source and target 'parent' directory
- Do not initially create the destination directory, allow the first distcp to do that. For example: If I want to sync source `/data/a` with `/data/a_target`, do *NOT* pre-create the 'a_target' directory.
Allow snapshots on the source directory

hdfs dfsadmin -allowSnapshot /data/a

Create a Snapshot of /data/a

hdfs dfs -createSnapshot /data/a s1

Distcp the baseline copy (from the atomic snapshot). Note: /data/a_target does NOT exists prior to the following command.

hadoop distcp /data/a/.snapshot/s1 /data/a_target

Allow snapshots on the newly create target directory

hdfs dfsadmin -allowSnapshot /data/a_target

At this point /data/a_target should be considered "read-only". Do NOT make any changes to the content here.
Create a matching snapshot in /data/a_target that matches the name of the snapshot used to build the baseline

hdfs dfs -createSnapshot /data/a_target s1

Add some content to the source directory /data/a. Make changes, add, deletes, etc. that need to be replicated to /data/a_target.
Take a new snapshot of /data/a

hdfs dfs -createSnapshot /data/a s2

Just for fun, check on whats changed between the two snapshots

hdfs snapshotDiff /data/a s1 s2

Ok, now let's migrate the changes to /data/a_target

hadoop distcp -diff s1 s2 -update /data/a /data/a_target

When that's completed, finish the cycle by creating a matching snapshot on /data/a_target

hdfs dfs -createSnapshot /data/a_target s2

That's it. You've completed the cycle. Rinse and repeat.

A Few Hints

Remember, snapshots need to be managed manually. They will stay around forever unless you clean them up with:

hdfs dfs -deleteSnapshot

As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed.

Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised.

If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.

mtdeguzis · ‎09-09-2017

Just so everyone is aware:

The snapshot created dirs must be named the same on both sides to do the diff distcp:

Cannot find the snapshot of directory /group/bti/snapshot with name /group/bti/.snapshot/s20170908-080603.486

#LOF:
/group/bti/snapshot/.snapshot/s20170908-212827.054

Due to default naming conventions, the folders will not be the same. The default folder names created are seemingly time-stamped to the second. Name each created folder with todays day, such as "s20170908" so when the diff distcp runs, it can find and update the same-day folder on the LOF side.

Sreedhar_ch · ‎07-16-2018

There is challenge I am facing .. when I am running the snapshotdiff from a remote cluster it is failing with snapshot not found error even though it is available .. do we have any solution for this .. we built a DR cluster and running distcp from DR to utilize the DR resources instead of overloading the PROD .. any solution how this can be achived..

Cloudera Community