Support Questions

jiiiiken88 · ‎02-14-2017

Hi guys,

I am looking for a best practice using Hive especially for database modeling, software development and if possible a version control. At the moment I struggle at a certain point where the logical world meets the code.

I have found tools which assist modelling databases in Hive, e.g. Embaracadero (Hortonworks Partner?). So I could model my databases there and create DDL scripts, I guess. To get version control I can add those scripts to git or something else. What happens if many users want to work on the logical model? How do you handle such problems? Jumping back and forth between database model versions is only possible with git versioning not only by the tool, or is it?

All other scripts regarding the hive databases and tables (ingestion and so on) live in a git repository. So they are under perfect version control but if something changes many adaptions have to be made (at least one in the config file and maybe in Insert statements etc.). What I am missing in the code world is a nice view of databases and tablss even a entity-relation-diagramm but that's not of main interest.

What is in your opinion a good way to tackle these problems? I mean someone like Facebook does not want to manage all tables and databases via a Hive View or solely based on code, or do they? How to keep the oversight in the big data world?

Any help is really appreciated! Thanks in advance!

Kind regards,

Ken

sunile_manjee · ‎02-26-2017

in my previous life I was part of data modeling team. For hive I used Embaracadero data architect for hive physical data modeling. your logical model should not be impacted by the physical implementation. This is what the gods of LDM tell you. And I tend to agree. You should continue to do version control of your LDM just like you do today as again, the LDM has no baring on PDM. Now version control PDM is similar to how you do on RDMBS. Store all your DDLs in sequential order or invest in source control. Again no different then what you do today fro RDBMS. I believe from a modeling perspective the challenge is how to apply modeling principles to hive. using 3NF is not the right approach. Using the kimball approach (data marts) apply in this space. Those that do not want to invest time into thinking through a data model will say to hyper denormalize in hive. That means they generally don't have a clue about what a data model is and the importance of one in a enterprise. With hive LLAP & tool such at AtScale allow us to model in a more "native" format.

Important: for nosql data models, it is best to use domain driven design practice. This is the closest one will get to a PDM. nosql does not follow relational theory, and hence modeling such using PDM core values is a nugatory exercise. Now Apache Phoenix which is a sql skin on top of hbase changes that. You can apply some relational theory PDM rules there.

View solution in original post

sunile_manjee · ‎02-26-2017

in my previous life I was part of data modeling team. For hive I used Embaracadero data architect for hive physical data modeling. your logical model should not be impacted by the physical implementation. This is what the gods of LDM tell you. And I tend to agree. You should continue to do version control of your LDM just like you do today as again, the LDM has no baring on PDM. Now version control PDM is similar to how you do on RDMBS. Store all your DDLs in sequential order or invest in source control. Again no different then what you do today fro RDBMS. I believe from a modeling perspective the challenge is how to apply modeling principles to hive. using 3NF is not the right approach. Using the kimball approach (data marts) apply in this space. Those that do not want to invest time into thinking through a data model will say to hyper denormalize in hive. That means they generally don't have a clue about what a data model is and the importance of one in a enterprise. With hive LLAP & tool such at AtScale allow us to model in a more "native" format.

Important: for nosql data models, it is best to use domain driven design practice. This is the closest one will get to a PDM. nosql does not follow relational theory, and hence modeling such using PDM core values is a nugatory exercise. Now Apache Phoenix which is a sql skin on top of hbase changes that. You can apply some relational theory PDM rules there.

jiiiiken88 · ‎02-27-2017

Hi Sunile,

Thanks for your answer. We think we will store our initial model and then all alter scripts. But all alter scripts will be included in the initial model in cas a complete re-deployment is wanted. To view a logical model a tool will be used which can reverse engineer the ddl. We try to establish a workflow in that fashion and hope that works.

Cloudera Community

Support Questions

Database modeling, software development, and version control