Created on 12-15-2016 01:40 PM
Traditional 'distcp' from one directory to another or from cluster to cluster is quite useful in moving massive amounts of data, once. But what happens when you need to "update" a target directory or cluster with only the changes made since the last 'distcp' had run. That becomes a very tricky scenario. 'distcp' offers an '-update' flag, which is suppose to move only the files that have changed. In this case 'distcp' will pull a list of files and directories from the source and targets, compare them and then build a migration plan.
The problem: It's an expensive and time-consuming task. Furthermore, the process is not atomic. First, the cost of gathering a list of files and directories, along with their metadata is expensive when you're considering sources with millions of file and directory objects. And this cost is incurred on both the source and target namenode's, resulting in quite a bit of pressure on those systems.
It's up to 'distcp' to reconcile the difference between the source and target, which is very expensive. When it's finally complete, only then does the process start to move data. And if data changes while the process is running, those changes can impact the transfer and lead to failure and partial migration.
The process needs to be atomic, and it needs to be efficient. With Hadoop 2.0, HDFS introduce "snapshots." HDFS "snapshots" are a point-in-time copy of the directories metadata. The copy is stored in a hidden location and maintains references to all of the immutable filesystem objects. Creating a snapshot is atomic, and the characteristics of HDFS (being immutable) means that an image of a directories metadata doesn't require an addition copy of the underlying data.
Another feature of snapshots is the ability to efficiently calculate changes between 'any' two snapshots on the same directory. Using 'hdfs snapshotDiff ', you can build a list of "changes" between these two point-in-time references.
[hdfs@m3 ~]$ hdfs snapshotDiff /user/dstreev/stats s1 s2 Difference between snapshot s1 and snapshot s2 under directory /user/dstreev/stats: M . + ./attempt M ./namenode/fs_state/2016-12.txt M ./namenode/nn_info/2016-12.txt M ./namenode/top_user_ops/2016-12.txt M ./scheduler/queue_paths/2016-12.txt M ./scheduler/queue_usage/2016-12.txt M ./scheduler/queues/2016-12.txt
Let's take the 'distcp' update concept and supercharge it with the efficiency of snapshots. Now you have a solution that will scale far beyond the original 'distcp -update.' and in the process remove the burden and load from the namenode's previously encountered.
hdfs dfsadmin -allowSnapshot <path>
hdfs dfsadmin -allowSnapshot /data/a
hdfs dfs -createSnapshot /data/a s1
hadoop distcp /data/a/.snapshot/s1 /data/a_target
hdfs dfsadmin -allowSnapshot /data/a_target
hdfs dfs -createSnapshot /data/a_target s1
hdfs dfs -createSnapshot /data/a s2
hdfs snapshotDiff /data/a s1 s2
hadoop distcp -diff s1 s2 -update /data/a /data/a_target
hdfs dfs -createSnapshot /data/a_target s2
That's it. You've completed the cycle. Rinse and repeat.
Remember, snapshots need to be managed manually. They will stay around forever unless you clean them up with:
hdfs dfs -deleteSnapshot
As long as a snapshot exists, the data exists. Deleting, even with skipTrash, data from a directory that has a snapshot, doesn't free up space. Only when all "references" to that data are gone, can space be reclaimed.
Initial migrations of data between systems are very expensive in regards to network I/O. And you probably don't want to have to do that again, ever. I recommend keeping a snapshot of the original copy on each system OR some major checkpoint you can go back to, in the event the process is compromised.
If 'distcp' can't validate that the snapshot (by name) between the source and the target are the same and that the data at the target hasn't changed since the snapshot, the process will fail. If the failure is because the directory has been updated, you'll need to use the above baseline snapshots to restore it without having to migrate all that data again. And then start the process up again.
Created on 09-09-2017 02:41 PM
Just so everyone is aware:
The snapshot created dirs must be named the same on both sides to do the diff distcp:
Cannot find the snapshot of directory /group/bti/snapshot with name /group/bti/.snapshot/s20170908-080603.486 #LOF: /group/bti/snapshot/.snapshot/s20170908-212827.054
Due to default naming conventions, the folders will not be the same. The default folder names created are seemingly time-stamped to the second. Name each created folder with todays day, such as "s20170908" so when the diff distcp runs, it can find and update the same-day folder on the LOF side.
Created on 07-16-2018 11:34 AM
There is challenge I am facing .. when I am running the snapshotdiff from a remote cluster it is failing with snapshot not found error even though it is available .. do we have any solution for this .. we built a DR cluster and running distcp from DR to utilize the DR resources instead of overloading the PROD .. any solution how this can be achived..