Support Questions

Find answers, ask questions, and share your expertise

Understanding Distcp -delete option removing directories

avatar

I am trying to understand distcp delete option basically each time I do distcp i would like to overwrite destination directories.

The overwrite option only does it with files so if there is a file in destination with same content of a file in source but a different name would not be overriden but I would like to do it at the directory level as well.

From the hadoop docs

Delete the files existing in the dst but not in src

https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

Thank you

1 REPLY 1

avatar
@na

I think thats expected behaaviour. For your scenario I would better suggest to go for DistCp between Snapshot Difference.

distcp -update -diff -delete /source /destination

How to Use This Feature

To use this feature, you should first make sure all assumptions are met. Typical steps are described as follows:

  1. Create snapshot s0 in the source directory.
  2. Issue a default distcp command that copies everything from s0 to the target directory (command line is like distcp -update <sourceDir>/.snapshot/s0 <targetDir>).
  3. Create snapshot s0 in the target dir.
  4. Make some changes in the source dir.
  5. Create a new snapshot s1, and issue a distcp command like distcp -update -diff s0 s1 <sourceDir> <targetDir> to copy all changes between s0 and s1 to the target directory.
  6. Create a snapshot with the same name s1 in the target dir.
  7. Repeat steps 4 to 6 with a new snapshot name—for example, s2.
Link

Hope this helps you.