Support Questions
Find answers, ask questions, and share your expertise

Duplicate/Sync directories in HDFS

Explorer

I have a folder in HDFS that will have files coming in everyday. I want to duplicate the folder in such a way that whenever a new file comes to the original folder, I want that to be duplicated/synced in the duplicate folder.

 

Basically, I want to sync a folder with another in HDFS

 

How can we achieve that in hadoop?

 

3 REPLIES 3

Champion
To do it in NRT or to be monitoring for newly arrived files I would look at building some long lived/streaming application, probably in Spark.

If you can get by with just syncing and a schedule basis I would look at using cron/distcp combo or Cloudera BDR.

Explorer

@mbigelow I would go with syncing and scheduling the sync on regular basis. But I am confused with distcp and cron usage together.

 

Could you please give me an example on how we can achieve this 

 

Thank you

Champion
There is just not scheduler built into distcp. So you have set up all your options and run the distcp job and it will run that once. So you need something else to tell it to run on scheduler. Cron is my default but this could be oozie as well or anything else that can schedule jobs.

I'll work up an example later.