Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Duplicate/Sync directories in HDFS


I have a folder in HDFS that will have files coming in everyday. I want to duplicate the folder in such a way that whenever a new file comes to the original folder, I want that to be duplicated/synced in the duplicate folder.


Basically, I want to sync a folder with another in HDFS


How can we achieve that in hadoop?



To do it in NRT or to be monitoring for newly arrived files I would look at building some long lived/streaming application, probably in Spark.

If you can get by with just syncing and a schedule basis I would look at using cron/distcp combo or Cloudera BDR.


@mbigelow I would go with syncing and scheduling the sync on regular basis. But I am confused with distcp and cron usage together.


Could you please give me an example on how we can achieve this 


Thank you

There is just not scheduler built into distcp. So you have set up all your options and run the distcp job and it will run that once. So you need something else to tell it to run on scheduler. Cron is my default but this could be oozie as well or anything else that can schedule jobs.

I'll work up an example later.
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.