At my company, we are researching a lot the DV 2.0 data model and making some PoCs, but there isn't a lot of experiences on the web. I'm concerned about data replication (keeping data history in the Enterprise layer, almost replicating data in our data lake). Even though is not exclusively DV related, joins are costly and time-consuming with Hive and even with Impala. We already developed pyspark applications to reduce the time of this joins, getting interesting improvements, trying to get better times for constructing the staging, enterprise and data access layers. We are already using parquet files and partitioning.
I would appreciate any experience you can share with me
I would say that experience of doing the same has not been very pleasant.
To make the things worse, I inherited it from an existing team where the previous architect finally understood his blunder and left the organization. What he left behind was a poorly architected DV2.0 on Hive1.1. The solution was extremely slow.
- We have now ended storing data in Parquet format.
- The never ending slow jobs have been moved to PySpark, but still they take hours to complete even for few million records.
- With lack of ACID support on Hive1.1, all dedupes are being done outside Hive using Spark.
- We still keep running into issues of duplicates on multi-satellite scenarios.
- The joins are still very time consuming.
- Overall, it was never a good decision to implement a high normalized Data Model like DV2.0 on Hive which is optimized for denormalized data.