Support Questions
Find answers, ask questions, and share your expertise

once Data transfer from Netezza to hive using sqoop. How to verfiy all records from both side using script?

 
5 REPLIES 5

Really hard question. There are different ways of doing it:

1) the "easy" way. Define a set of business queries like a test suite of common SQL queries plus some basic ones like count(*) min/max/avg of evert number column and make sure both return the same result. Should be good enough for most cases just make sure you not only do simple checks like count(*).

2) The hard way

Do a complete comparison. To do this you need to pull both data sets out, sort them and then diff them. Linux can do that and I created java applications that do similar things before. But for really big datasets hardly feasible.

First you need to find a way to unload both tables in an identical way. you cannot use beeline since they format data in a given way. There is also floating point exactness etc.

Using JDBC often seems to be the easiest way and do the writing yourself. You can then sort the files using Java ( load the files as lines and do Collection.sort ) or sort both files with the linux sort command. Then use the linux diff command or implement something similar in Java.

Some data analysis programs might be able to do the same. If anybody knows of any ...

Thanks for reply if possible can you please send the java application you created!

@Benjamin Leonhardi @mike pal We have a similar situation and we are planning to use 1st approach of defining custom validation queries and running checks after data transfer.

Hive has a HPL/SQL library which includes a CMP function. Might be useful.

http://www.hplsql.org/cmp

New Contributor

if there are no implied transformations... such as precision of a timestamp, or representation of a null value, you might be able to use the netezza hash8 or `hash4()` functions to create a crc check. I'm sure something like would be available in the java world with map reduce or spark to do the check on the target side also. This might be more higher performance for a table with a lot of columns.