I like to know if Federated Namespaces will help us for the following situations.
Sandbox for STG Hbase (as well as HDFS)
Sandbox for deploying new version of our application into one (STG) namespace, but other (PROD) namespace still has old version of our application without interfering each other
Limiting usage of resources while maximizing available resources.
Our current Hadoop environment
We have the very small QA cluster (on VM) that is used for testing configuration changes and CDH upgrade, and small functional tests before moving new codes to prod CDH cluster. (No performance measurement will be done at QA cluster because QA cluster can’t access data on production system.)
We have the small Prod cluster with 15 data nodes. This prod cluster will be used for actual functional/performance/ testings as well as the production run.
For protecting two different types (prod and testing) of runs, we programmatically manage Namespaces using naming convention - a prefix (STG_ or PROD_) for HBase table / HDFS directory names by ourselves instead of using HDFS Federation providing the management of HDFS Namespaces.
I know we can use HDFS federation for separating namespaces for HDFS, but I am not sure if that provides namespaces for Hbase as well. In other words, when we run Spark jobs (or Sqoop) on STG namenode of federation, we like to restrict access of the Hbase and other system only to STG namespace. We don’t want to manage names of Hbase tables using a prefix (STG or PROD) because we may accidentally access other namespace and ruin the data. In this ways, we can safely use same HBase table name for both STG and PROD jobs without corriding HBase on other environment.
Also, I wonder if we can have different version of our applications on each namespace. For example, can we deploy new version of application to STG namespace, but not PROD namespaces yet, so that we can test our new changes on STG namespace before pushing to PROD namespace.
I know YARN provide Fair Scheduler (CDH 5 set the default to Fair Scheduler). When we have a federated namespace, can we set Fair Scheduler (or something similar) between STG and PROD namespaces.
For example, when no job is running on PROD namespace, we like to use full resources (memory and CPU) for STG jobs. But when any PROD job starts, STG jobs (in the middle of running) must release resources to PROD jobs upto the preconfigured quota.
If the goal is proper isolation between an unstable environment and a production one - HDFS federation wouldn't be the best idea.
You can associate HBase to one of your federated NameNode/NameService URI and that is the one it will use. HBase does not support spreading its table load across multiple nameservices, and thereby if you want both your environments to have HBase as an independent entity, you must run two HBase services, each hosted by a different NameNode via the hbase.rootdir and fs.defaultFS property URI configurations.
HBase does have 'namespaces' of its own, but that's quite different than what HDFS namespaces mean in federation context (HBase's are more similar to Databases in Hive).
As long as you divide your configurations well in addressing specific NameNodes, this should be doable, yes.
YARN has no support for identifying one namespace over another, much like HBase does not. It can be bound to one namespace for the history logging/etc. parts, but if your goal is to run a single YARN cluster serving both sets of users/usages, you need to do so via regular queue configurations (and then explicitly using such queues via apps, or making the auto-queue-placement rules work in your benefit with whatever sources of user->queue mapping data it supports).