Code Repositories
Find and share code repositories
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Super Guru
Repo Description

A framework for systematically quality controlling big data.

TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:

  1. How to define and measure data quality
  2. How to efficiently ensure data quality across many data sets
  3. How to institutionalize existing knowledge of data sets

TopNotch uses rules to verify individual components of a data set. Each rule defines and measures some small component of data quality. The combination of rules provides a complete definition of and metrics for quality in a data set. The rules can be reused on other data sets to maximize efficiency. Finally, the clear definitions and reuseability of these rules allows users to institutionalize knowledge by documenting a data set.

Requires SBT, Spark 1.6, Scala 2.10, Java 8, YARN, Hadoop 2.6, Hbase 0.98, Spray.

Repo Info
Github Repo URL
Github account name blackrock
Repo name TopNotch
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎12-28-2016 09:13 PM
Updated by:
Top Kudoed Authors