Member since
11-23-2017
2
Posts
0
Kudos Received
0
Solutions
12-10-2017
08:56 PM
Having to copy the data from 1 cluster to another just to be able to do some validations is quite bad IMHO. When I worked in Hortonworks we used a bit that approach on one project but I eventually I saw that there were many drawbacks in it. Instead, I think that using a checksum approach so that all the heaviest computations is done locally to each cluster is much better. I have been developing a small Python program that allows you to do such comparison (and eventually see the differences in a webbrowser) and it works quite well in my current company: https://github.com/bolcom/hive_compared_bq Hopefully it can work for you @Alex Raj
... View more
11-23-2017
05:16 AM
Recently I have developed a new program that allows to do the same (Hive comparisons) in a much more efficient way (totally scalable, better visualization of the differences, skew consideration etc). Please consider this new approach instead of the old one I published in this article 2 years ago: https://github.com/bolcom/hive_compared_bq Code is again OpenSource so I welcome feedbacks, suggestions of new features and also contributions 🙂
... View more