I would say that experience of doing the same has not been very pleasant.
To make the things worse, I inherited it from an existing team where the previous architect finally understood his blunder and left the organization. What he left behind was a poorly architected DV2.0 on Hive1.1. The solution was extremely slow.
- We have now ended storing data in Parquet format.
- The never ending slow jobs have been moved to PySpark, but still they take hours to complete even for few million records.
- With lack of ACID support on Hive1.1, all dedupes are being done outside Hive using Spark.
- We still keep running into issues of duplicates on multi-satellite scenarios.
- The joins are still very time consuming.
- Overall, it was never a good decision to implement a high normalized Data Model like DV2.0 on Hive which is optimized for denormalized data.