Code Repositories
Find and share code repositories
Super Guru
Repo Description

A framework for systematically quality controlling big data.

TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:

  1. How to define and measure data quality
  2. How to efficiently ensure data quality across many data sets
  3. How to institutionalize existing knowledge of data sets

TopNotch uses rules to verify individual components of a data set. Each rule defines and measures some small component of data quality. The combination of rules provides a complete definition of and metrics for quality in a data set. The rules can be reused on other data sets to maximize efficiency. Finally, the clear definitions and reuseability of these rules allows users to institutionalize knowledge by documenting a data set.

Requires SBT, Spark 1.6, Scala 2.10, Java 8, YARN, Hadoop 2.6, Hbase 0.98, Spray.

Repo Info
Github Repo URL https://github.com/blackrock/TopNotch
Github account name blackrock
Repo name TopNotch
411 Views
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.
Version history
Last update:
‎12-28-2016 09:13 PM
Updated by:
Contributors
Top Kudoed Authors