Hi, I am exploring ingesting data from an organization's existing data warehouse / databases into Hadoop (HDFS & Hive). One of the concerns is ensuring data integrity e.g. number of rows, correctness of data. Are there any best practices or approaches to assure our operational users that the data imported from their source databases into Hadoop is accurate and complete? Thanks!
In my previous job we took an agile (MVP based) approach. First we deployed a small script that would count the rows per month in each fact table, and total rows for each dimension table, in both the Hive warehouse and the source system. Next, we developed a web dashboard to highlight any differences. From there, you can go further and add more detailed integrity checks.
However, the first line of defense should be a monitoring and alerting system that flags any inconsistencies already at the time of the import job, so immediate action can be taken before discrepancies pile up.