Support Questions

Find answers, ask questions, and share your expertise

Geographically Distributed HBase

avatar

Users from any region can query the complete data however region specific data should stay in that region.For example (UK region data in UK data centre). And also data replication should happen only in that region.

Is this possible with geographically distributed HBase cluster ?

1 ACCEPTED SOLUTION

avatar
Guru

There are two aspects to the question. First, is whether the replication can be controlled inside a region and have the data of a user to live only inside that region. This is possible in theory in a couple of different ways.

If we can partition the users by region to different tables, and setup replication of all datacenters within the region, then we have achieved the boundary requirements. Some tables can be replicated to only datacenters within the region, while some other tables will be replicated cross-regions. HBase's replication model is pretty flexible in the sense that, we can do cyclic replication, etc (please read https://hbase.apache.org/book.html#_cluster_replication). If we cannot partition by table, we can still use the same table, but partition by column family (as noted above). Otherwise, we can still respect boundaries, using a recent feature called WALEntryFilter's. The basic idea would be to implement a custom WALEntryFilter which either (a) understands the data and selects which edits (mutations) to send to the receiving side (another geo-region) or (b) tag every edit with the intended regions it should hit and have the WALEntryFilter respect the tags from mutations.

The second aspect is whether you can query the whole data set from any region. Of course, if you have some data not leaving its particular geo-region, you cannot have all the data aggregated in a single DC. So the only way to access the data in whole would be to dynamically send the query to all affected geo-regions and merge the results back.

View solution in original post

4 REPLIES 4

avatar

Hey this is a tough one

Hdfs is not big on a distributed cluster for many reasons like latency, data transport for all the nodes etc... Hbase "distributed" beeing on Hdfs gets the same issues and a couple others. Add to this HDFS replicates data around the cluster including Hbase data so respecting "region policies" would not work, in some situations Hbase data is not even local to the region server servicing it. So as you can see guaranteeing a custom specific locality of data in Hadoop cluster is no easy feat.

avatar

You could maybe do it in hbase if there were specific columns for regions, e.g. the EU-UK columns would hold uk stuff --then use the security settings to restrict access to those columns by users within the group of people.

That won't address replication: HBase data has to live in the place the HDFS Cluster is, be it EU, US or elsewhere.

avatar

This post might be interesting, although it's all about replication:

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-geo-replication

avatar
Guru

There are two aspects to the question. First, is whether the replication can be controlled inside a region and have the data of a user to live only inside that region. This is possible in theory in a couple of different ways.

If we can partition the users by region to different tables, and setup replication of all datacenters within the region, then we have achieved the boundary requirements. Some tables can be replicated to only datacenters within the region, while some other tables will be replicated cross-regions. HBase's replication model is pretty flexible in the sense that, we can do cyclic replication, etc (please read https://hbase.apache.org/book.html#_cluster_replication). If we cannot partition by table, we can still use the same table, but partition by column family (as noted above). Otherwise, we can still respect boundaries, using a recent feature called WALEntryFilter's. The basic idea would be to implement a custom WALEntryFilter which either (a) understands the data and selects which edits (mutations) to send to the receiving side (another geo-region) or (b) tag every edit with the intended regions it should hit and have the WALEntryFilter respect the tags from mutations.

The second aspect is whether you can query the whole data set from any region. Of course, if you have some data not leaving its particular geo-region, you cannot have all the data aggregated in a single DC. So the only way to access the data in whole would be to dynamically send the query to all affected geo-regions and merge the results back.