A framework for systematically quality controlling big data.
TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:
How to define and measure data quality
How to efficiently ensure data quality across many data sets
How to institutionalize existing knowledge of data sets
TopNotch uses rules to verify individual components of a data set. Each rule defines and measures some small component of data quality. The combination of rules provides a complete definition of and metrics for quality in a data set. The rules can be reused on other data sets to maximize efficiency. Finally, the clear definitions and reuseability of these rules allows users to institutionalize knowledge by documenting a data set.