Created 06-20-2016 12:21 PM
Hello,
Sorry for my question, but I'n a newbie and still don't know about Falcon, Hive and so on...still reading and learning...
We have a hadoop cluster with a 100 TB HDFS filesystem with 20 datanodes and we use HBase and Kafka. Everything is working, but now we need to backup the information for restoring if something happens to our cluster.
For this, I'm deploying a new cluster with only 2 machines, one of them connected to EMC CX4-480 for storage. We don't need speed in this cluster, it's going to be only backup to maintain our HDFS and HBase data safe.
In the beginning, due to HBase is stored in HDFS, I thought replicating HDFS structure to second cluster would be enough, but I have been reading in this forums and different tools exist depending of what to make backup of... this leads me to confusion 😞
Is there an easy explanation of how to backup data to this cluster? Basically HDFS, HBase and Kafka.
BTW, could these two backup nodes be under the same Ambari deployment in the main cluster or do I need to deploy a new complete installation in one of them (ambari, hbase master, zookeeper....)?
As you can read, I'm quite newbie...sorry 😞
Thanks a lot,
Silvio
Created 06-20-2016 12:48 PM
Silvio,
For backup and DR purposes, you can use distcp / Falcon to cover HDFS data. HBase replication can be used to maintain a second backup cluster
The preferable approach to cover HDFS, Hive and HBase is as below:
1. Enable and use HDFS snapshots
2. Using distcp2, replicate HDFS snapshots between clusters. The present version of Falcon doesn't support HDFS snapshots and hence distcp2 need to be used. If functionality becomes available in Falcon, same can be leveraged
3. For Hive metadata, Falcon can help replicate the metastore
4. For HBase data, please see https://hbase.apache.org/0.94/replication.html
5. For Kafka, use Kafka's native MirrorMaker functionality
Hope this helps!!
Created 06-20-2016 12:48 PM
Silvio,
For backup and DR purposes, you can use distcp / Falcon to cover HDFS data. HBase replication can be used to maintain a second backup cluster
The preferable approach to cover HDFS, Hive and HBase is as below:
1. Enable and use HDFS snapshots
2. Using distcp2, replicate HDFS snapshots between clusters. The present version of Falcon doesn't support HDFS snapshots and hence distcp2 need to be used. If functionality becomes available in Falcon, same can be leveraged
3. For Hive metadata, Falcon can help replicate the metastore
4. For HBase data, please see https://hbase.apache.org/0.94/replication.html
5. For Kafka, use Kafka's native MirrorMaker functionality
Hope this helps!!
Created 06-20-2016 12:59 PM
Hi Vijaya, thanks for your answer.
But if HBase data is in HDFS (I can see HBase tables as folders in HDFS strusture), if I replicate HDFS am I not replicating HBase data too? I think I still have much to read and learn...
On the other side, does distcp2 supposes performance impact?
Regards,
Silvio
Created 06-20-2016 01:34 PM
HBase stores its data underneath in HDFS only. But if you copy at HDFS level, it would only provide you the raw data, but you would miss all the HBase level metadata like table information etc. As I mentioned above, you can use HDFS snapshots with distcp2 only for HDFS data i.e. data directly stored in HDFS. For data stored in HBase, you can either use HBase snapshots or if affordability is not an issue, you can use HBase replication.
distcp2 simply spawns map reduce jobs under the hood. So it does take vital cluster resources in the form of YARN container. Hence ensure distcp jobs are run at off-peak hours when cluster utilisation is kept to a minimum
Created 06-20-2016 02:23 PM
Ok, so, if we would need to backup only HBase tables, we could use HBase replication and it would not be neccesary the use of distcp2, right? I suppose HBase replication copies underneath HDFS data too
Created 06-20-2016 03:06 PM
Either HBase replication (https://hbase.apache.org/0.94/replication.html) or HBase snapshots (http://hbase.apache.org/0.94/book/ops.snapshots.html) with ExportSnapshot tool can help you to get HBase data replicated to your secondary cluster.
HBase uses HDFS as the underlying file system. So yes, if you replicate HBase, all the data stored by those HBase tables would be replicated to your secondary cluster.
Created 06-20-2016 03:23 PM
Ok, thank you very much for your answers.
Looking at http://blog.cloudera.com/blog/2013/11/approaches-to-backup-and-disaster-recovery-in-hbase/, I think "HBase replication" is my solution: almost no impact, incremental backups... On the other side, we are currently creating snapshots of tables in a daily manner.
I am creating a new cluster for this. I was thinking about a 2 node cluster, one as Master node with all master roles (Hbase master, zookeeper...) and one Data Node with enough storage for backup data
My question is:
- Should it be totally independent, with all roles installed or can I connect it to my main cluster under Ambari umbrella? I need it only for backup, I'm not going to use it for production if something happens to my main production cluster
Regards,
Silvio
Created 06-20-2016 03:48 PM
@Silvio del Val, at present Ambari supports managing only one cluster per Ambari instance. So, in your case, you may need another Ambari deployment in the target cluster managing it.
Created 06-20-2016 04:31 PM
Yes, I know. I was thinking about "config groups" in Ambari. Using such, maybe I could use an independent HDFS filesystem for the backup and use the same zookeepers for replication.... maybe complex...
Yes, maybe a new hole cluster would be the best solution...I think I'll do that.
Thank you very much for your support
Created 06-20-2016 01:17 PM
For HDFS backup you can do using distcp job.
For Hbase pls check this - https://community.hortonworks.com/questions/17836/which-is-best-method-for-taking-backup-of-hbase-da...