New Contributor
Posts: 1
Registered: ‎02-11-2016

HBase Federation

I like to know if Federated Namespaces will help us for the following situations.
  1. Sandbox for STG Hbase (as well as HDFS)
  2. Sandbox for deploying new version of our application into one (STG) namespace, but other (PROD) namespace still has old version of our application without interfering each other
  3. Limiting usage of resources while maximizing available resources.
Our current Hadoop environment
We have the very small QA cluster (on VM) that is used for testing configuration changes and CDH upgrade, and small functional tests before moving new codes to prod CDH cluster. (No performance measurement will be done at QA cluster because QA cluster can’t access data on production system.)
We have the small Prod cluster with 15 data nodes. This prod cluster will be used for actual functional/performance/ testings as well as the production run.
For protecting two different types (prod and testing) of runs, we programmatically manage Namespaces using naming convention - a prefix (STG_ or PROD_) for HBase table / HDFS directory names by ourselves instead of using HDFS Federation providing the management of HDFS Namespaces.  
Question 1:
I know we can use HDFS federation for separating namespaces for HDFS, but I am not sure if that provides namespaces for Hbase as well. In other words, when we run Spark jobs (or Sqoop) on STG namenode of federation, we like to restrict access of the Hbase and other system only to STG namespace. We don’t want to manage names of Hbase tables using a prefix (STG or PROD) because we may accidentally access other namespace and ruin the data. In this ways, we can safely use same HBase table name for both STG and PROD jobs without corriding HBase on other environment.
Question 2:
Also, I wonder if we can have different version of our applications on each namespace. For example, can we deploy new version of application to STG namespace, but not PROD namespaces yet, so that we can test our new changes on STG namespace before pushing to PROD namespace.
Question 3:
I know YARN provide Fair Scheduler (CDH 5 set the default to Fair Scheduler). When we have a federated namespace, can we set Fair Scheduler (or something similar) between STG and PROD namespaces.
For example, when no job is running on PROD namespace, we like to use full resources (memory and CPU) for STG jobs. But when any PROD job starts, STG jobs (in the middle of running) must release resources to PROD jobs upto the preconfigured quota.


Posts: 1,896
Kudos: 433
Solutions: 303
Registered: ‎07-31-2013

Re: HBase Federation

If the goal is proper isolation between an unstable environment and a
production one - HDFS federation wouldn't be the best idea.

For (1):

You can associate HBase to one of your federated NameNode/NameService URI
and that is the one it will use. HBase does not support spreading its table
load across multiple nameservices, and thereby if you want both your
environments to have HBase as an independent entity, you must run two HBase
services, each hosted by a different NameNode via the hbase.rootdir and
fs.defaultFS property URI configurations.

HBase does have 'namespaces' of its own, but that's quite different than
what HDFS namespaces mean in federation context (HBase's are more similar
to Databases in Hive).

For (2):

As long as you divide your configurations well in addressing specific
NameNodes, this should be doable, yes.

For (3):

YARN has no support for identifying one namespace over another, much like
HBase does not. It can be bound to one namespace for the history
logging/etc. parts, but if your goal is to run a single YARN cluster
serving both sets of users/usages, you need to do so via regular queue
configurations (and then explicitly using such queues via apps, or making
the auto-queue-placement rules work in your benefit with whatever sources
of user->queue mapping data it supports).