Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to compare two hive tables that are in different clusters?

avatar
Expert Contributor

I have similar hive tables in two clusters, say my Staging and Prod. I would like to compare the tables in Staging and Prod. Is there a straight way to do it using Hive or Beeline?

13 REPLIES 13

avatar
Expert Contributor

Although not directly via beeline or Hive CLI, many database clients have tools to do comparisons from different data sources. I use AquaData Studio and it has worked very well for my needs. Obviously this is best suited for small datasets as the comparison is happening at the client and not the cluster.

avatar
Rising Star

If you had some time to spend you can use Falcon to orchestrate data replication from one cluster to another: you had to create Hive table on the second cluster but then you can compare easily the two tables directly from Beeline.

avatar
Expert Contributor

I would like to elaborate my requirement here.

We have a process that runs on a data set that cleanse it and transforms the data. This process runs in both Staging and Prod individually. We found that there was a data miss match in both the clusters after the job completes. And we wanted to find out the different records that were available in both the clusters. For this purpose I wrote a shell script that will run two beeline queries in the two clusters, generate the output locally and using the shell commands. I, then compare the data and generate a report out of it which is quite a long process. So, I wanted to know if we can compare the two tables from different cluster at the same time with any hive or beeline query.

avatar
New Contributor

Having to copy the data from 1 cluster to another just to be able to do some validations is quite bad IMHO. When I worked in Hortonworks we used a bit that approach on one project but I eventually I saw that there were many drawbacks in it.

Instead, I think that using a checksum approach so that all the heaviest computations is done locally to each cluster is much better.

I have been developing a small Python program that allows you to do such comparison (and eventually see the differences in a webbrowser) and it works quite well in my current company:

https://github.com/bolcom/hive_compared_bq

Hopefully it can work for you @Alex Raj