Created 08-14-2024 11:50 AM
Hello Everyone,
We are developing a data lakehouse using Hive for the banking and financial sector. We would appreciate your insights on the following:
1. Which data modeling approach is recommended for this domain?
2. Are there any sample models available for reference?
3. What best practices should we follow to ensure data integrity and performance?
4. How can we efficiently manage large-scale data ingestion and processing?
5. Are there any specific challenges or pitfalls we should be aware of when implementing a lakehouse in this sector?
Your expertise and guidance would be greatly appreciated.
Created 08-14-2024 12:17 PM
@APentyala Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our CDW experts @smruti @asish who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 08-14-2024 09:08 PM
@APentyala Please find the answers below:
1. Which data modeling approach is recommended for this domain?
Ans: If you have large data, we would recommend to go with Partitioning or multi-level partitioning. You could implement Bucketing if the data inside partition is large.
2. Are there any sample models available for reference?
Ans: You could take a refrence for partitioning and bucketing in https://www.linkedin.com/pulse/what-partitioning-vs-bucketing-apache-hive-shrivastava/
You could create a new table perfroom CTAS with Dynamic Partitiining from the existing table
Refrence: https://www.geeksforgeeks.org/overview-of-dynamic-partition-in-hive/
3. What best practices should we follow to ensure data integrity and performance?
Ans: Please follow below best parctices:
a. Paartion and bucket it
b. You could use Iceberg table which would reduce the significant load on Metastore, if you are using CDP Public CLoud or CDP private CLoud(ECS/Opesnshit)
c. Use ORC/parquet
d. Use EXTERNAL tables,if you dont perfrom Update/Delete as reading External table is faster.
4. How can we efficiently manage large-scale data ingestion and processing?
Ans: The model follows as:
Kafka/Spark Streaming: Ingestion
Spark: Data Modelling
Hive: Warehosuing where you extract the data
Please. be specific on the use case.
5. Are there any specific challenges or pitfalls we should be aware of when implementing a lakehouse in this sector?
Ans: There should be no challenges, we would request to provide more briefing on this.
Created on 08-18-2024 11:46 AM - edited 08-18-2024 11:54 AM
Hi @asish
Thank you for your answers. Could you please provide more details on the following
1. Data Modeling Design: Which model is best suited for a Lakehouse implementation, star schema or snowflake schema?
2. We are using CDP (Private) and need to implement updates and deletes (SCD Type 1 & 2). Are there any limitations with Hive external tables?
3. Are there any pre-built dimension models or ER models available for reference?
Created 08-19-2024 06:43 AM
hi @APentyala
1. 1. Data Modeling Design: Which model is best suited for a Lakehouse implementation, star schema or snowflake schema?
Ans: We don't have those designs or we are not aware of those
2. We are using CDP (Private) and need to implement updates and deletes (SCD Type 1 & 2). Are there any limitations with Hive external tables?
Ans: There are no limitations for EXTERNAL tables. Are you using HDFS or islon to store?
3. Are there any pre-built dimension models or ER models available for reference?
Created 08-19-2024 01:17 PM
Yes, using HDFS to store, is there any limitation?
Created 08-16-2024 12:55 PM
@APentyala Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.
Regards,
Diana Torres,