Support Questions

Find answers, ask questions, and share your expertise

distcp update vs distcp update with snapshots

avatar
Rising Star

Hi,

I am trying to understand what are the benefits of using distcp -update vs distcp -update with hdfs snapshot differences?

As I understand, update without any snapshot options will only replicate the modified data at source and doesnt touch files that are already existing at the destination. What additional benefits would one realise using HDFS snapshots in distcp based replication?

Thanks

Vijay

1 ACCEPTED SOLUTION

avatar
Contributor

Mainly two benefits:

1. Avoid unnecessary copy for renamed files/directories. If we renamed a large directory on the source side, "distcp -update" cannot detect the rename thus will copy the whole renamed directory as a new one.

2. More efficient copy list generation. "distcp -update" needs to scan the whole directory and detect identical files during the copy process. Thus the copy list generation may take a long time for a big directory. Using snapshot diff based approach can greatly decrease this workload in case of an incremental sync scenario.

However, snapshot based distcp requires very careful snapshot management on both the source and target clusters. E.g., the target cluster must not have any modification between two copies. Otherwise the diff may not be applied correctly.

View solution in original post

1 REPLY 1

avatar
Contributor

Mainly two benefits:

1. Avoid unnecessary copy for renamed files/directories. If we renamed a large directory on the source side, "distcp -update" cannot detect the rename thus will copy the whole renamed directory as a new one.

2. More efficient copy list generation. "distcp -update" needs to scan the whole directory and detect identical files during the copy process. Thus the copy list generation may take a long time for a big directory. Using snapshot diff based approach can greatly decrease this workload in case of an incremental sync scenario.

However, snapshot based distcp requires very careful snapshot management on both the source and target clusters. E.g., the target cluster must not have any modification between two copies. Otherwise the diff may not be applied correctly.