Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best Method to Verify Hive/Impala Replication

Best Method to Verify Hive/Impala Replication

New Contributor
I'd like to verify that the data which replicatied to our secondary site matches our source, in some repeatable low impact way.  Like some kind of checksum on the Hive/Impala level.  The first idea I had was to compare row counts of hive tables (after Replication has completed and stats are computed).  Instead of trying to run a select count(*) on each table and waste a lot of processing, I thought it would be easier to query the metadata of the table.   For instance in "show table stats <table>; I see the number of rows.  Is there some kind of way that I can compare the Impala table stats between the two clusters?  I'd love to grab the numRows from the table stats and write them out and run a diff on them.... or something like that.  
 
Or stepping back, is there a better way to validate that I have everything replicated at the Hive/Impala level?
 
Thank you.