Community Articles

mkumar13 · ‎07-19-2016

This article is first series of three articles, next coming articles with some code and mechanism present in latest version of HBase supporting HBase Replication.

HBase Replication

Hbase Replication solution can solve the cluster security, data security, read and write separation, operation and maintenance, and the guest operating errors, and so the ease of management and configuration, provide powerful online applications support.

Hbase replication currently used in the industry are rare, because there are many aspects, such as HDFS has multiple backup copies in a way to help security HBASE underlying data, and the relatively small number of companies in the cluster size. Another reason the data is not very high degree of importance, such as some logging system or as a second warehouse of historical data to split a large number of read requests. Such data lost to be present or back up at other places (database cluster). For such cases the Slave Replication cluster become dispensable, the fundamental importance not reflected. Therefore in hbase management platform a low level of security and essential services is area of concern and following discussion of Replication cluster cannot waste time to read.

Currently on HBase exists very important applications, both online and off-line applications. So security Hbase data also appears very important. For the problems often come from a single cluster are following:-

Failure data managers, irreversible DDL operations.
BLOCK underlying HDFS file block corruption
Excessive short-term pressure on the cluster read data caused by adding servers to deal with this situation is more a waste of resources.
System upgrades, maintenance, diagnose problems will cause the cluster unavailable time to grow.
Double the atomic difficult to guarantee.
Unpredictable for some reason. (Eg engine room off, large-scale hardware damage, disconnection, etc.)
Impact of MR computing offline applications cause larger delay on-line literacy.

If you worry about the above questions, then, Replication main cluster is a good choice, and we are in this area to do some simple research. By simply following the problems we encountered in the use and methods taken.

It is popular online backup comparison program

For backup solutions to a redundant data center there are several angles to analyze like current consistency, transactional delay, throughput, data loss and Failover we have currently several options:-

Simple Backup: - Simple backup mode where timing of Dump the cluster is scheduled, usually by snapshot to set the timestamp. We can make an elegant design too for on-line data center with low interference or no interference. However, this scheme is have some disadvantages like just before the time point of snapshot if unexpected events occur inevitable lead to data loss of entire duration, as many people cannot accept that.

Master-slave mode: -Master-slave mode (Master-the Slave) This model is simple compared to a lot more advantages backup mode, you can ensure data consistency eventual consistency, data from the primary cluster to the standby cluster low latency, asynchronous writes will not the primary cluster to bring pressure on performance, how much will have a minimal impact on performance, incident comes less data loss, and the main cluster in the standby cluster can also be guaranteed. Usually by constructing better Log system plus check Point to achieve, can read and write separation, the primary cluster can act as reader services, but only to prepare clusters generally bear reading services.

Master master mode: - Master master mode (Master-Master) principle is similar to the overall master-slave mode, the difference is two clusters can take each other to write separation, can bear to read and write services.

Two -phase commit:- Two phase commit such programs to ensure consistency and strong transaction, the server returned to the client successfully indicates that certain data has been successfully backed up, it will not cause any data loss.Each server can bear to read and write services. But the disadvantage is the delay caused by cluster higher overall throughput decreases.

Paxosalgorithm: - Paxos algorithm based on Paxos strong consistency algorithm program implementation, the same client connection server to ensure data consistency. The disadvantage is complex, latency and throughput clusters with different clustered servers.

Hbase simple backup mode if the table is not online relatively easy to handle, you can copy table or distcp or spapshot table. If the table is online and offline cannot be allowed only through snapshot scheme online table implement a backup.

Hbase Replication master-slave mode equipment by specifying the cluster will send Hlog data asynchronously to the standby inside the cluster, basically no performance impact on the primary cluster, the data delay time is shorter. Main cluster provides literacy services to prepare the cluster to provide reading services. If the primary cluster fails, you can quickly switch to the backup cluster. We can look back to Hbase backup status, Hbase can offer the online backup and offline backup through the above simple backup mode, master-slave and the master-mode three backup modes.

Hbase Replication Master Master Mode between two mutual clusters backup, provide literacy services, separate read and write.

By comparison, the overall opinion Hbase Replication solution can solve the cluster security, data security, read and write separation, operation and maintenance, and the guest operating errors, and so the question, and ease of management and configuration, provide powerful online applications support.

to be continue...

Cloudera Community

Community Articles

HBase Replication and comparison with popular online backup programs...

Apache Hadoop

Apache HBase