02-08-201907:33 AM - last edited on 02-09-201906:16 AM by cjervis
I'd like to verify that the data which replicatied to our secondary site matches our source, in some repeatable low impact way. Like some kind of checksum on the Hive/Impala level. The first idea I had was to compare row counts of hive tables (after Replication has completed and stats are computed). Instead of trying to run a select count(*) on each table and waste a lot of processing, I thought it would be easier to query the metadata of the table. For instance in "show table stats <table>; I see the number of rows. Is there some kind of way that I can compare the Impala table stats between the two clusters? I'd love to grab the numRows from the table stats and write them out and run a diff on them.... or something like that.
Or stepping back, is there a better way to validate that I have everything replicated at the Hive/Impala level?