- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Best practice architecture and naming hdfs path names/hive database for dev and test on 1 cluster
- Labels:
-
Apache Hive
-
HDFS
Created on ‎08-01-2017 11:05 AM - edited ‎09-16-2022 05:01 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are moving our Oracle "landing" data into Hadoop. In Oracle we have three environments and three Oracle databases: dwdev, dwtest, and dwprod. The goal is to have three separate "landing" zones in Hadoop that will feed into each Oracle database, respectively, i.e. Hadoop dev feeds Oracle dwdev, etc.
The dev and test hadoop environment will exist on a single physical hadoop cluster.
How do we architect this?
HDFS
/<env>/data/<information_area>/<table_name>
/dev/data/marketing/customer_master
/test/data/marketing/customer_master
HIVE
database namespace (or schema_owner) = db_marketing
table name = customer_master
In DEV select * from db_marketing.customer_master would source from /dev/data/marketing/customer_master
In TEST select * from db_marketing.customer_master would source from /test/data/marketing/customer_master
Does this require multiple metastores?
What is best practice for multiple environments on a single Hadoop cluster?
Created on ‎08-01-2017 07:46 PM - edited ‎08-01-2017 07:48 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can have seprate database on Hive Metastore for all the three enviroments ,plus the HDFS directory structure I would strongly recommend you to have sperate Hadoop clusters for your production . Because you will have 2 or more Hadoop Namenode HA or Resourcemanager HA configured for all the three enviroments maintaining them will be a cubersome if there is going to be heavy load on all the three enviroment on a single Hadoop Clusters
What you can do is isolate the production from the other two enviroment have a sperate Hadoop clusters .
And I assume you are plaining to have a shared Metastore for HIVE / IMPALA .
You should also take in to consideration of other eco systems .
Please let me know if you need more information.
Created ‎08-02-2017 07:23 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My background is in Oracle, so I need further clarification on how to architect the Hadoop environment. (It is my understanding that in Hadoop, a "database namespace" is synonymous with "schema owner" in Oracle)
We current have three Oracle databases: dwdev, dwtst, and dwprd and they are each on their own hardware server.
Within each database, there is a schema owner named marketing which owns a set of tables, ie. customer_master, product_master, etc.
If we want to simulate this in Hadoop, except that dev and test will exist on the same hardware cluster, how do we do that given that we do not want to change the existing schema owner and table names? So, in oracle, you have .,. Does this require having separate metastores?
Created on ‎08-02-2017 08:41 AM - edited ‎08-02-2017 09:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you please refer the below link , will give you some insight in
creating the HDFS directory structure (best practices)
https://www.quora.com/What-is-the-best-directory-structure-to-store-different-types-of-logs-in-HDFS-...
Also it is always recommended to have odd numbers when you configure Mater nodes. As you will be using HA / Zookeeper . More of like 3 Master .
You can create database in Hive and create tables underneath as you do in Oracle ,Create / Give permission to the schema Owner. You dont need a seperate Hive Metastore. You can also configure a Shared Metastore where Both Hive / Impala can use em.
Created ‎12-16-2017 06:13 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm loocking by best practices for architecture and naming hdfs file path names for naming taxonomy considering the user are analytical users who implement data preparation and data modeling process?
I appreciate to share tips to desing a service on HDFS with overwrite strategy enough to get easy and friendly data model for train analytical and statistical models process in an modeling as a service.
For instance to get files with +3000 columns and storage more than 48 months of history. any tip to manage huge volumen of data.
Created on ‎12-16-2017 06:14 PM - edited ‎12-16-2017 06:17 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm loocking by best practices for architecture and naming hdfs considering the user are analytical users who implement data preparation and data modeling process?
I appreciate to share tips to desing a service on HDFS with overwrite strategy enough to get easy and friendly data model for train analytical and statistical models process in an modeling as a service.
For instance to get files with +3000 columns and storage more than 48 months of history. any tip to manage huge volumen of data.
Created ‎12-16-2017 06:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm loocking by best practices for architecture and naming hdfs file path names for naming taxonomy considering the user are analytical users who implement data preparation and data modeling process?
I appreciate to share tips to desing a service on HDFS with overwrite strategy enough to get easy and friendly data model for train analytical and statistical models process in an modeling as a service.
For instance to get files with +3000 columns and storage more than 48 months of history. any tip to manage huge volumen of data.
