I'd like to verify that the data which replicatied to our secondary site matches our source, in some repeatable low impact way. Like some kind of checksum on the Hive/Impala level. The first idea I had was to compare row counts of hive tables (after Replication has completed and stats are computed). Instead of trying to run a select count(*) on each table and waste a lot of processing, I thought it would be easier to query the metadata of the table. For instance in "show table stats <table>; I see the number of rows. Is there some kind of way that I can compare the Impala table stats between the two clusters? I'd love to grab the numRows from the table stats and write them out and run a diff on them.... or something like that.
Or stepping back, is there a better way to validate that I have everything replicated at the Hive/Impala level?