Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎09-19-2016

Best Method to Verify Hive/Impala Replication

[ Edited ]
I'd like to verify that the data which replicatied to our secondary site matches our source, in some repeatable low impact way.  Like some kind of checksum on the Hive/Impala level.  The first idea I had was to compare row counts of hive tables (after Replication has completed and stats are computed).  Instead of trying to run a select count(*) on each table and waste a lot of processing, I thought it would be easier to query the metadata of the table.   For instance in "show table stats <table>; I see the number of rows.  Is there some kind of way that I can compare the Impala table stats between the two clusters?  I'd love to grab the numRows from the table stats and write them out and run a diff on them.... or something like that.  
 
Or stepping back, is there a better way to validate that I have everything replicated at the Hive/Impala level?
 
Thank you.
Announcements