Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Hadoop Enterprise Data Integrity

Highlighted

Hadoop Enterprise Data Integrity

Contributor

Hi, I am exploring ingesting data from an organization's existing data warehouse / databases into Hadoop (HDFS & Hive). One of the concerns is ensuring data integrity e.g. number of rows, correctness of data. Are there any best practices or approaches to assure our operational users that the data imported from their source databases into Hadoop is accurate and complete? Thanks!

2 REPLIES 2
Highlighted

Re: Hadoop Enterprise Data Integrity

In my previous job we took an agile (MVP based) approach. First we deployed a small script that would count the rows per month in each fact table, and total rows for each dimension table, in both the Hive warehouse and the source system. Next, we developed a web dashboard to highlight any differences. From there, you can go further and add more detailed integrity checks.

However, the first line of defense should be a monitoring and alerting system that flags any inconsistencies already at the time of the import job, so immediate action can be taken before discrepancies pile up.

Highlighted

Re: Hadoop Enterprise Data Integrity

Contributor

Thanks for sharing! Sounds like good first steps, wondering if there are any other best practices out there as well.

Don't have an account?
Coming from Hortonworks? Activate your account here