Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

distcp update vs distcp update with snapshots

Solved Go to solution

distcp update vs distcp update with snapshots

Contributor

Hi,

I am trying to understand what are the benefits of using distcp -update vs distcp -update with hdfs snapshot differences?

As I understand, update without any snapshot options will only replicate the modified data at source and doesnt touch files that are already existing at the destination. What additional benefits would one realise using HDFS snapshots in distcp based replication?

Thanks

Vijay

1 ACCEPTED SOLUTION

Accepted Solutions

Re: distcp update vs distcp update with snapshots

New Contributor

Mainly two benefits:

1. Avoid unnecessary copy for renamed files/directories. If we renamed a large directory on the source side, "distcp -update" cannot detect the rename thus will copy the whole renamed directory as a new one.

2. More efficient copy list generation. "distcp -update" needs to scan the whole directory and detect identical files during the copy process. Thus the copy list generation may take a long time for a big directory. Using snapshot diff based approach can greatly decrease this workload in case of an incremental sync scenario.

However, snapshot based distcp requires very careful snapshot management on both the source and target clusters. E.g., the target cluster must not have any modification between two copies. Otherwise the diff may not be applied correctly.

1 REPLY 1

Re: distcp update vs distcp update with snapshots

New Contributor

Mainly two benefits:

1. Avoid unnecessary copy for renamed files/directories. If we renamed a large directory on the source side, "distcp -update" cannot detect the rename thus will copy the whole renamed directory as a new one.

2. More efficient copy list generation. "distcp -update" needs to scan the whole directory and detect identical files during the copy process. Thus the copy list generation may take a long time for a big directory. Using snapshot diff based approach can greatly decrease this workload in case of an incremental sync scenario.

However, snapshot based distcp requires very careful snapshot management on both the source and target clusters. E.g., the target cluster must not have any modification between two copies. Otherwise the diff may not be applied correctly.

Don't have an account?
Coming from Hortonworks? Activate your account here