We have a brand new cluster (HDP 3.1) of which we're looking to use the Hive Transactional capability that's built into Hive 3. While migrating processes from our old cluster (HDP 2.3) to this new one and incorporating Hive Transactional capabilities we noticed a substantial degradation in performance when compared to the old cluster.
We disabled transactions by creating the source tables as External Non-Transactional tables and ran the same query and we got fairly comparable results so this tells us that the Hive Transactions are causing some slow down.
After some analysis, we saw that data was being saved into the target table with only One reducer. Considering the job we're benchmarking at the moment is trying to save several Trillion records, it's safe to assume this is the bottleneck and everything is trying to be routed through that one reducer. After playing around with it, we found that adding bucketing and partitioning to the target table works to increase the number of reducers and thus improve performance.