I would say that experience of doing the same has not been very pleasant. To make the things worse, I inherited it from an existing team where the previous architect finally understood his blunder and left the organization. What he left behind was a poorly architected DV2.0 on Hive1.1. The solution was extremely slow. - We have now ended storing data in Parquet format. - The never ending slow jobs have been moved to PySpark, but still they take hours to complete even for few million records. - With lack of ACID support on Hive1.1, all dedupes are being done outside Hive using Spark. - We still keep running into issues of duplicates on multi-satellite scenarios. - The joins are still very time consuming. - Overall, it was never a good decision to implement a high normalized Data Model like DV2.0 on Hive which is optimized for denormalized data.
... View more