About luangsay

luangsay · ‎12-10-2017

Having to copy the data from 1 cluster to another just to be able to do some validations is quite bad IMHO. When I worked in Hortonworks we used a bit that approach on one project but I eventually I saw that there were many drawbacks in it. Instead, I think that using a checksum approach so that all the heaviest computations is done locally to each cluster is much better. I have been developing a small Python program that allows you to do such comparison (and eventually see the differences in a webbrowser) and it works quite well in my current company: https://github.com/bolcom/hive_compared_bq Hopefully it can work for you @Alex Raj

luangsay · ‎11-23-2017

Recently I have developed a new program that allows to do the same (Hive comparisons) in a much more efficient way (totally scalable, better visualization of the differences, skew consideration etc). Please consider this new approach instead of the old one I published in this article 2 years ago: https://github.com/bolcom/hive_compared_bq Code is again OpenSource so I welcome feedbacks, suggestions of new features and also contributions 🙂

Online	Offline
Last Visited	‎12-10-2017 08:56 PM

Member Since	‎11-23-2017 05:08 AM
Last Visited	‎12-10-2017 08:56 PM
Posts	2

Cloudera Community

Re: How to compare two hive tables that are in dif...

Re: Create a Hive Script to Validate Tables