Support Questions

APentyala · ‎08-14-2024

Hello Everyone,

We are developing a data lakehouse using Hive for the banking and financial sector. We would appreciate your insights on the following:

1. Which data modeling approach is recommended for this domain?
2. Are there any sample models available for reference?
3. What best practices should we follow to ensure data integrity and performance?
4. How can we efficiently manage large-scale data ingestion and processing?
5. Are there any specific challenges or pitfalls we should be aware of when implementing a lakehouse in this sector?

Your expertise and guidance would be greatly appreciated.

apentyala

DianaTorres · ‎08-14-2024

@APentyala Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our CDW experts @smruti @asish who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.

Regards,

Diana Torres,
Community Moderator

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

asish · ‎08-14-2024

@APentyala Please find the answers below:

1. Which data modeling approach is recommended for this domain?

Ans: If you have large data, we would recommend to go with Partitioning or multi-level partitioning. You could implement Bucketing if the data inside partition is large.

2. Are there any sample models available for reference?

Ans: You could take a refrence for partitioning and bucketing in https://www.linkedin.com/pulse/what-partitioning-vs-bucketing-apache-hive-shrivastava/
You could create a new table perfroom CTAS with Dynamic Partitiining from the existing table
Refrence: https://www.geeksforgeeks.org/overview-of-dynamic-partition-in-hive/

3. What best practices should we follow to ensure data integrity and performance?

Ans: Please follow below best parctices:

a. Paartion and bucket it
b. You could use Iceberg table which would reduce the significant load on Metastore, if you are using CDP Public CLoud or CDP private CLoud(ECS/Opesnshit)
c. Use ORC/parquet
d. Use EXTERNAL tables,if you dont perfrom Update/Delete as reading External table is faster.

4. How can we efficiently manage large-scale data ingestion and processing?

Ans: The model follows as:
Kafka/Spark Streaming: Ingestion
Spark: Data Modelling
Hive: Warehosuing where you extract the data

Please. be specific on the use case.

5. Are there any specific challenges or pitfalls we should be aware of when implementing a lakehouse in this sector?

Ans: There should be no challenges, we would request to provide more briefing on this.

APentyala · ‎08-18-2024

Hi @asish

Thank you for your answers. Could you please provide more details on the following

1. Data Modeling Design: Which model is best suited for a Lakehouse implementation, star schema or snowflake schema?

2. We are using CDP (Private) and need to implement updates and deletes (SCD Type 1 & 2). Are there any limitations with Hive external tables?

3. Are there any pre-built dimension models or ER models available for reference?

apentyala

asish · ‎08-19-2024

hi @APentyala

1. 1. Data Modeling Design: Which model is best suited for a Lakehouse implementation, star schema or snowflake schema?

Ans: We don't have those designs or we are not aware of those

2. We are using CDP (Private) and need to implement updates and deletes (SCD Type 1 & 2). Are there any limitations with Hive external tables?

Ans: There are no limitations for EXTERNAL tables. Are you using HDFS or islon to store?

3. Are there any pre-built dimension models or ER models available for reference?

apentyala

Ans : We don't have any thing as such

APentyala · ‎08-19-2024

Yes, using HDFS to store, is there any limitation?

apentyala

DianaTorres · ‎08-16-2024

@APentyala Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.

Regards,

Diana Torres,
Community Moderator

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

Support Questions

Data Wareshouse design in Hive