Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

HBase HA vs HDFS replication...

Solved Go to solution

HBase HA vs HDFS replication...

Rising Star

Hi,

I'm currently looking at "HA" feature of HBase, but cannot figure out how it works exactly.

I first created tables using default java API, without specifying any region replication value, and thinking that default HDFS replication mechanism would guarantee data availability. Actually, when I look at region files on HDFS, they are shown with "3" as replication factors :

Ex :

[myuser@myhost ~]$ hdfs dfs -ls /apps/hbase/data/data/default/MY_TEST_TABLE/f24af874470de9b85c2e1bd0ff5f80b3/0 Found 1 items -rw------- 3 hbase hdfs 12234 2017-03-29 15:44 /apps/hbase/data/data/default/MY_TEST_TABLE/f24af874470de9b85c2e1bd0ff5f80b3/0/125b6555b2274e64b1ba4e9a8ef42885

So why should I set a region replication value (eg. 3) in addition to default HDFS one ? Does it means that my data will eventually be replicated by 9 ?

Thanks for any clue about this...

Sebastien

1 ACCEPTED SOLUTION

Accepted Solutions

Re: HBase HA vs HDFS replication...

Yes, exactly! Data stored on HDFS is not affected in any way, so all files used by a single HBase region are still replaced only 3 times. What is further replicated to achieve RS HA are read-only secondary keys held by respective Region Servers. You can find a good explanation here. What you get in return is faster recovery for reading from HBase. For "write" you still need to wait longer (like without RS HA), until the HBase master activates affected regions on other Region Servers.

2 REPLIES 2

Re: HBase HA vs HDFS replication...

Rising Star

After more reading, it seems that region replication may be used for read high availability...

If I understand properly, it means that when a RS fails, its regions are moved to other "valid" region servers and are still available, but it may take a while ... So region replication's purpose is just to reduce this waiting period ? Nothing related to data physical replication in order to guarantee that we won't loose any data, right ?

Re: HBase HA vs HDFS replication...

Yes, exactly! Data stored on HDFS is not affected in any way, so all files used by a single HBase region are still replaced only 3 times. What is further replicated to achieve RS HA are read-only secondary keys held by respective Region Servers. You can find a good explanation here. What you get in return is faster recovery for reading from HBase. For "write" you still need to wait longer (like without RS HA), until the HBase master activates affected regions on other Region Servers.