Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Backing up HDFS production data

avatar
Expert Contributor

Hi experts,

This question is mostly related to DR and backup.

We already have two clusters ( where are exactly same in configuration and one is master and another is hot standby). To mitigate the risk further, we think of a 'cold backup', where we can store the HDFS data just like previous tape based backup solutions. And want to have this stored in our data center. (not on cloud)

We do not want to invest another cluster and use distcp based approach. Want to backup only hdfs data.

What could be the best solution/approach/design around the same.

Let me know if more inputs required.

Many thanks,

SS

1 ACCEPTED SOLUTION

avatar
Super Guru

@Smart Solutions

The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc.

Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides.

If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way

You excluded distcp as an option. As such, I suggest to look at Falcon.

Check this: http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon...

+++++++

if any response addressed your question, please vote and accept best answer.

View solution in original post

3 REPLIES 3

avatar
Expert Contributor

avatar

avatar
Super Guru

@Smart Solutions

The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc.

Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides.

If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way

You excluded distcp as an option. As such, I suggest to look at Falcon.

Check this: http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon...

+++++++

if any response addressed your question, please vote and accept best answer.