Support Questions

smartninja723 · ‎10-10-2016

Hi experts,

This question is mostly related to DR and backup.

We already have two clusters ( where are exactly same in configuration and one is master and another is hot standby). To mitigate the risk further, we think of a 'cold backup', where we can store the HDFS data just like previous tape based backup solutions. And want to have this stored in our data center. (not on cloud)

We do not want to invest another cluster and use distcp based approach. Want to backup only hdfs data.

What could be the best solution/approach/design around the same.

Let me know if more inputs required.

Many thanks,

SS

cstanca · ‎10-11-2016

@Smart Solutions

The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc.

Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides.

If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way

You excluded distcp as an option. As such, I suggest to look at Falcon.

Check this: http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon...

+++++++

if any response addressed your question, please vote and accept best answer.

View solution in original post

smartninja723 · ‎10-10-2016

@Jonas Straub, @Simon Elliston Ball, @Ana Gillan,@Guilherme Braccialli

ashneesharma88 · ‎10-10-2016

@ Smart Solution

Please refer the link if this helps you :-

https://community.hortonworks.com/articles/43525/disaster-recovery-and-backup-best-practices-in-a-t....

cstanca · ‎10-11-2016

@Smart Solutions

The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc.

Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides.

If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way

You excluded distcp as an option. As such, I suggest to look at Falcon.

Check this: http://hortonworks.com/hadoop-tutorial/mirroring-datasets-between-hadoop-clusters-with-apache-falcon...

+++++++

if any response addressed your question, please vote and accept best answer.

Cloudera Community

Support Questions

Backing up HDFS production data

Can HDFS Rebalancer run without interrupted Produc...

Configuring Kerberos with OpenLDAP back-end

Old data in HDFS

Production master not coming up

Apache Nifi (aka HDF) data flow across data center

Generic HDFS data and Hive Database transfer autom...

Using Pig to convert uncompressed data to compress...

Deploying the Phoenix Query Server in production e...

Update production NiFi flow without stop and lose ...

HDFS Data Durability and Availability with replica...