Support Questions

Find answers, ask questions, and share your expertise

How to compare two hive tables that are in different clusters?

avatar
Expert Contributor

I have similar hive tables in two clusters, say my Staging and Prod. I would like to compare the tables in Staging and Prod. Is there a straight way to do it using Hive or Beeline?

13 REPLIES 13

avatar
Super Collaborator

Hive and Beeline are the clients configured for a specific cluster. They can not communicate to multiple clusters.

The best way to compare data is to distcp hive table data from one cluster to another then create an external table on this data. Now both the tables will be available in the same cluster, so you can compare them easily.

avatar
Guru

A shortcut you can take without copying data across is to copy metadata. This only works if all nodes from Staging can communicate with all nodes in Prod. Have a replica table from Prod into staging but hive 'location' pointing to the full hdfs path of prod. (If this is HA, you need to setup the client configuration to get the resolve HA path). Then you see both hive tables in staging, but one of it is actually pointing to HDFS from Prod. Compute in this case happens in staging but data comes from Prod for the Prod table. There will not be any 'data-local' processing but you can avoid copy of data.

You can take it the other way as well where comparison happens on Prod, but comparison happening on staging is better since you are not adding any compute load on Prod with this.

avatar
Master Guru

The way to do it will be to run a set of queries against both tables.

Testsuite:

This should include a set of basic queries ( counts, min,max,sum,avg of each number columns perhaps count distinct for string fields ), You also normally have a couple of more complex business queries in the mix ( testing suite ). You could do a SELECT * but in that case in a bigdata environment you most likely have too much data.

Executing the queries:

You can use beeline ( in contrast what was said above you CAN connect to a remote cluster with it. If your cluster is kerberized your client would need to be part of both realms though ). But please use PAM/LDAP authentication for Hiveserver2 makes life so much easier.

If you use beeline you can output the data as CSV data to make life simpler.

https://community.hortonworks.com/questions/25789/how-to-dump-the-output-from-beeline.html

You can also use other tools like sqlline or write a custom Java client.

Comparison tool:

You then need to compare the output with each other. You can do that in Linux using diff however in this case you need to have the data sorted by the query using an order by. Or sort using any linux tool.

You can also write a little java program that does everything in one. The below does this but much more but you could perhaps use it as an inspiration. HiveConnection and HiveHelper classes. But be-warned its ugly.

https://github.com/benleon/HivePerformance/

avatar
Guru

Its ideal any comparison happens within hadoop/hive/pig/spark using yarn instead of exporting CSVs and doing a diff on a client node. Export would still work on smaller datasets but its better we do the comparison within cluster rather than by taking data out of 2 clusters and doing a compare at client level. We have done some of these kind of comparisons using Pig. With this approach any corrective measures on your dataset like upsert and merge will be easier than client based csv diff.

avatar
Master Guru

I think only Alex knows rhe requirementa. If you want to do a sanity check if the main data propagation worked correctly I woild rather run a set of aggregation queries against hive on both and compare the results. Otherwise you essentially tell him "to check if distcp worked do distcp and compare " which is a bit catch22. Also for big amounts of data this is not efficient. It is better to run a good mixed set of test queries that cover all main columns etc. This way you also catch problems in hive.

avatar
Guru

True that only he knows the requirements. And if its plain compare, its easier just running hive queries on both datasets. The solution that I gave does not require distcp. Its a way to compare and merge data between 2 datasets without copying data, but have a hive table that points to data from other HDFS. This would still involve data transfer across the cluster on the mapside but works well for upsert/merge scenarios (which is generally where you will end up once comparison shows that there are differences in data)

avatar
Master Guru

I mean I see the benefit of parallel computation but you essentially pipe the full dataset over the network. If you can ascertain data correctness using a set of test queries on both clusters ( perhaps as part of that using something like an Aggregated MD5 hash if you want to be really safe ) you save yourself from having to do that.

I think all of the approaches have advantages:

a) Copying everything over to one cluster and comparing them locally:

This assumes that the copy process is safe but you can use the full power of hadoop to join and compare the data sets locally after.

b) Your approach of pointing one cluster to two hdfs

The computation happens in parallel and if you have a fast network connection between the nodes it might be a good approach.

c) My approach of running a set of test queries on both Hiveservers ( mix of simple aggregations, some representative business queries and perhaps a hash function and comparing the small results client side.

You can do the computations on both clusters locally without transmitting much out of the cluster network and you also test equivalence of Hive on both systems ( We had a question just today of hive giving wrong results because of statistics differences ). However you cannot do a full comparison of the data sets.

Choose your poison I suppose. I still think my approach makes most sense if you essentially want to compare two Hive warehouse instances that should be identical and it requires minimal changes to the cluster setup. However you obviously cannot do more in-depth record level comparisons. If you have big datasets and you absolutely have to compare every single record you need to go with a or b.

avatar
Guru

Good summary. My original comment was not against computations on both clusters locally, but against exporting data locally to csv and doing a diff. That is something that I try to avoid unless that data is very small.

avatar
Master Guru

totally agreed, pulling out a big table into the client and doing the diff there will not scale.