About DataVoyager

DataVoyager · ‎12-13-2021

I would say that experience of doing the same has not been very pleasant. To make the things worse, I inherited it from an existing team where the previous architect finally understood his blunder and left the organization. What he left behind was a poorly architected DV2.0 on Hive1.1. The solution was extremely slow. - We have now ended storing data in Parquet format. - The never ending slow jobs have been moved to PySpark, but still they take hours to complete even for few million records. - With lack of ACID support on Hive1.1, all dedupes are being done outside Hive using Spark. - We still keep running into issues of duplicates on multi-satellite scenarios. - The joins are still very time consuming. - Overall, it was never a good decision to implement a high normalized Data Model like DV2.0 on Hive which is optimized for denormalized data.

Online	Offline
Last Visited	‎12-13-2021 07:47 AM

Member Since	‎12-13-2021 04:12 AM
Last Visited	‎12-13-2021 07:47 AM
Posts	1

Cloudera Community

Re: Has someone implemented Data Vault 2.0 on Hado...