Support Questions

Find answers, ask questions, and share your expertise

What are best practices for architecture and naming hdfs file path names and hive database namespaces when development and test environments are on one cluster and production is on another cluster?

New Contributor

We are moving our Oracle "landing" data into Hadoop. In Oracle we have three environments and three Oracle databases: dwdev, dwtest, and dwprod. The goal is to have three separate "landing" zones in Hadoop that will feed into each Oracle database, respectively, i.e. Hadoop dev feeds Oracle dwdev, etc.

The dev and test hadoop environment will exist on a single physical hadoop cluster.

How do we architect this?

HDFS

/<env>/data/<information_area>/<table_name>

/dev/data/marketing/customer_master

/test/data/marketing/customer_master

HIVE

database namespace (or schema_owner) = db_marketing

table name = customer_master

In DEV select * from db_marketing.customer_master would source from /dev/data/marketing/customer_master

In TEST select * from db_marketing.customer_master would source from /test/data/marketing/customer_master

Does this require multiple metastores?

What is best practice for multiple environments on a single Hadoop cluster?

1 REPLY 1

Explorer

Hi Kimberlee ,

First point . There is a concept of resource management where one can assign resources to specific groups via "queue management" in Hadoop.

Checking queue concept around "capacity scheduler" should help you to distribute resources among your Test and Dev environment.

Second point.

Creating different schema-database space , schema owner , schema group will help in applying security on HDFS level and keep both environment exclusive to each other.

Third point.

Ranger policy can help you to define security for two different metastore for Dev and Test environment and keep data safe. Hope this helps.

Regards,

Fahim