We planing build data warehouse on hive. With streaming data to it.
I am looking for best practices or any ideas.
I found only one article about this, but it very old.
For any experience sharing will be very appreciate
Hi @Ilya Li
Building DW in hive is little tricky. Particularly saying choosing the model , whether to build dimensional modelling, data vault or any other traditional modelling approach. Because Joins are always costly if you need to build a dimensional model then there will be a fact table which has to be joined with many dimensions to get the required fields. I'm not saying not to go with dimensional model but rather think whether do you really need a DM in Hive. It all depends upon the use case of the warehouse which you are going to build.
Based on my experience build a conformed dimensions which can be consumed not only by the warehouse but also in future if any other datamart has to be built then it can make use of these conformed dimensions. When building a fact few things has to be considered. If the warehouse purpose is only for few reporting then make sure you have all the necessary attributes are captured from dimensions into the fact , so that when reports are being generated there is no need for you to traverse other dimensions.
Hive doesn't have PK and FK as of now. So you have to make sure the data integrity is maintained manually and hive will not throw any error when duplicate PK are inserted into a table. Generating sequence is possible in hive, but when a table contains billions of records in which you have to generate sequence there there will a performance issue. You have to either split the data into chunks and generate surrogate key or a UDF has to be written in such cases.
There are multiple file native format available in hive like ORC, Avro, Textfile , sequence file , parquet .,.etc. Choose a best one based on the need. Check for other table properties, indexing w.r.t hive. To highlight a few format ORC performs better with Zlib compression as ORC has in built index. If you schema is changing rapidly then go for avro or parquet. Again choose the best based on the need.
Check for Datalake in hive for scalability model.
Just to add to @Bala Vignesh N V answer: Performance is greatly increased if you use Hive LLAP (now GA in HDP 2.6) and ORC file format.
Note that latest release of HDP has ACID/MERGE functionality for slowy changing dimensions. See below link on Hive Streaming.
Also, if you want to consider OLAP cubes on HDP, it is worth checking into AtScale: https://hortonworks.com/partner/atscale/
Some helpful links:
Datawarehousing with HIve: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_data-access/content/ch_using-hive.html
Optimizing an Apache Hive Datawarehouse: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.0/bk_hive-performance-tuning/content/ch_hive-...