- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
distcp update vs distcp update with snapshots
- Labels:
-
Apache Hadoop
Created ‎12-01-2016 05:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to understand what are the benefits of using distcp -update vs distcp -update with hdfs snapshot differences?
As I understand, update without any snapshot options will only replicate the modified data at source and doesnt touch files that are already existing at the destination. What additional benefits would one realise using HDFS snapshots in distcp based replication?
Thanks
Vijay
Created ‎12-02-2016 09:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mainly two benefits:
1. Avoid unnecessary copy for renamed files/directories. If we renamed a large directory on the source side, "distcp -update" cannot detect the rename thus will copy the whole renamed directory as a new one.
2. More efficient copy list generation. "distcp -update" needs to scan the whole directory and detect identical files during the copy process. Thus the copy list generation may take a long time for a big directory. Using snapshot diff based approach can greatly decrease this workload in case of an incremental sync scenario.
However, snapshot based distcp requires very careful snapshot management on both the source and target clusters. E.g., the target cluster must not have any modification between two copies. Otherwise the diff may not be applied correctly.
Created ‎12-02-2016 09:09 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mainly two benefits:
1. Avoid unnecessary copy for renamed files/directories. If we renamed a large directory on the source side, "distcp -update" cannot detect the rename thus will copy the whole renamed directory as a new one.
2. More efficient copy list generation. "distcp -update" needs to scan the whole directory and detect identical files during the copy process. Thus the copy list generation may take a long time for a big directory. Using snapshot diff based approach can greatly decrease this workload in case of an incremental sync scenario.
However, snapshot based distcp requires very careful snapshot management on both the source and target clusters. E.g., the target cluster must not have any modification between two copies. Otherwise the diff may not be applied correctly.
