Support Questions

Find answers, ask questions, and share your expertise

Hadoop backup to other cluster..newbie question

avatar
Rising Star

Hello,

Sorry for my question, but I'n a newbie and still don't know about Falcon, Hive and so on...still reading and learning...

We have a hadoop cluster with a 100 TB HDFS filesystem with 20 datanodes and we use HBase and Kafka. Everything is working, but now we need to backup the information for restoring if something happens to our cluster.

For this, I'm deploying a new cluster with only 2 machines, one of them connected to EMC CX4-480 for storage. We don't need speed in this cluster, it's going to be only backup to maintain our HDFS and HBase data safe.

In the beginning, due to HBase is stored in HDFS, I thought replicating HDFS structure to second cluster would be enough, but I have been reading in this forums and different tools exist depending of what to make backup of... this leads me to confusion 😞

Is there an easy explanation of how to backup data to this cluster? Basically HDFS, HBase and Kafka.

BTW, could these two backup nodes be under the same Ambari deployment in the main cluster or do I need to deploy a new complete installation in one of them (ambari, hbase master, zookeeper....)?

As you can read, I'm quite newbie...sorry 😞

Thanks a lot,

Silvio

1 ACCEPTED SOLUTION

avatar
Rising Star

Silvio,

For backup and DR purposes, you can use distcp / Falcon to cover HDFS data. HBase replication can be used to maintain a second backup cluster

The preferable approach to cover HDFS, Hive and HBase is as below:

1. Enable and use HDFS snapshots

2. Using distcp2, replicate HDFS snapshots between clusters. The present version of Falcon doesn't support HDFS snapshots and hence distcp2 need to be used. If functionality becomes available in Falcon, same can be leveraged

3. For Hive metadata, Falcon can help replicate the metastore

4. For HBase data, please see https://hbase.apache.org/0.94/replication.html

5. For Kafka, use Kafka's native MirrorMaker functionality

Hope this helps!!

View solution in original post

9 REPLIES 9

avatar
Rising Star

Silvio,

For backup and DR purposes, you can use distcp / Falcon to cover HDFS data. HBase replication can be used to maintain a second backup cluster

The preferable approach to cover HDFS, Hive and HBase is as below:

1. Enable and use HDFS snapshots

2. Using distcp2, replicate HDFS snapshots between clusters. The present version of Falcon doesn't support HDFS snapshots and hence distcp2 need to be used. If functionality becomes available in Falcon, same can be leveraged

3. For Hive metadata, Falcon can help replicate the metastore

4. For HBase data, please see https://hbase.apache.org/0.94/replication.html

5. For Kafka, use Kafka's native MirrorMaker functionality

Hope this helps!!

avatar
Rising Star

Hi Vijaya, thanks for your answer.

But if HBase data is in HDFS (I can see HBase tables as folders in HDFS strusture), if I replicate HDFS am I not replicating HBase data too? I think I still have much to read and learn...

On the other side, does distcp2 supposes performance impact?

Regards,

Silvio

avatar
Rising Star

@Silvio del Val

HBase stores its data underneath in HDFS only. But if you copy at HDFS level, it would only provide you the raw data, but you would miss all the HBase level metadata like table information etc. As I mentioned above, you can use HDFS snapshots with distcp2 only for HDFS data i.e. data directly stored in HDFS. For data stored in HBase, you can either use HBase snapshots or if affordability is not an issue, you can use HBase replication.

distcp2 simply spawns map reduce jobs under the hood. So it does take vital cluster resources in the form of YARN container. Hence ensure distcp jobs are run at off-peak hours when cluster utilisation is kept to a minimum

avatar
Rising Star

Ok, so, if we would need to backup only HBase tables, we could use HBase replication and it would not be neccesary the use of distcp2, right? I suppose HBase replication copies underneath HDFS data too

avatar
Rising Star

@Silvio del Val

Either HBase replication (https://hbase.apache.org/0.94/replication.html) or HBase snapshots (http://hbase.apache.org/0.94/book/ops.snapshots.html) with ExportSnapshot tool can help you to get HBase data replicated to your secondary cluster.

HBase uses HDFS as the underlying file system. So yes, if you replicate HBase, all the data stored by those HBase tables would be replicated to your secondary cluster.

avatar
Rising Star

Ok, thank you very much for your answers.

Looking at http://blog.cloudera.com/blog/2013/11/approaches-to-backup-and-disaster-recovery-in-hbase/, I think "HBase replication" is my solution: almost no impact, incremental backups... On the other side, we are currently creating snapshots of tables in a daily manner.

I am creating a new cluster for this. I was thinking about a 2 node cluster, one as Master node with all master roles (Hbase master, zookeeper...) and one Data Node with enough storage for backup data

My question is:

- Should it be totally independent, with all roles installed or can I connect it to my main cluster under Ambari umbrella? I need it only for backup, I'm not going to use it for production if something happens to my main production cluster

Regards,

Silvio

avatar
Rising Star

@Silvio del Val, at present Ambari supports managing only one cluster per Ambari instance. So, in your case, you may need another Ambari deployment in the target cluster managing it.

avatar
Rising Star

Yes, I know. I was thinking about "config groups" in Ambari. Using such, maybe I could use an independent HDFS filesystem for the backup and use the same zookeepers for replication.... maybe complex...

Yes, maybe a new hole cluster would be the best solution...I think I'll do that.

Thank you very much for your support

avatar
Super Guru