Support Questions

schausson · ‎04-06-2017

Hi,

I'm currently looking at "HA" feature of HBase, but cannot figure out how it works exactly.

I first created tables using default java API, without specifying any region replication value, and thinking that default HDFS replication mechanism would guarantee data availability. Actually, when I look at region files on HDFS, they are shown with "3" as replication factors :

Ex :

[myuser@myhost ~]$ hdfs dfs -ls /apps/hbase/data/data/default/MY_TEST_TABLE/f24af874470de9b85c2e1bd0ff5f80b3/0 Found 1 items -rw------- 3 hbase hdfs 12234 2017-03-29 15:44 /apps/hbase/data/data/default/MY_TEST_TABLE/f24af874470de9b85c2e1bd0ff5f80b3/0/125b6555b2274e64b1ba4e9a8ef42885

So why should I set a region replication value (eg. 3) in addition to default HDFS one ? Does it means that my data will eventually be replicated by 9 ?

Thanks for any clue about this...

Sebastien

pminovic · ‎04-06-2017

Yes, exactly! Data stored on HDFS is not affected in any way, so all files used by a single HBase region are still replaced only 3 times. What is further replicated to achieve RS HA are read-only secondary keys held by respective Region Servers. You can find a good explanation here. What you get in return is faster recovery for reading from HBase. For "write" you still need to wait longer (like without RS HA), until the HBase master activates affected regions on other Region Servers.

View solution in original post

schausson · ‎04-06-2017

After more reading, it seems that region replication may be used for read high availability...

If I understand properly, it means that when a RS fails, its regions are moved to other "valid" region servers and are still available, but it may take a while ... So region replication's purpose is just to reduce this waiting period ? Nothing related to data physical replication in order to guarantee that we won't loose any data, right ?

pminovic · ‎04-06-2017

Yes, exactly! Data stored on HDFS is not affected in any way, so all files used by a single HBase region are still replaced only 3 times. What is further replicated to achieve RS HA are read-only secondary keys held by respective Region Servers. You can find a good explanation here. What you get in return is faster recovery for reading from HBase. For "write" you still need to wait longer (like without RS HA), until the HBase master activates affected regions on other Region Servers.

Cloudera Community

Support Questions

HBase HA vs HDFS replication...