Code Repositories

Find and share code repositories
Welcome to the upgraded Community! Read this blog to see What’s New!
Super Guru
Repo Description

A framework for systematically quality controlling big data.

TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:

  1. How to define and measure data quality
  2. How to efficiently ensure data quality across many data sets
  3. How to institutionalize existing knowledge of data sets

TopNotch uses rules to verify individual components of a data set. Each rule defines and measures some small component of data quality. The combination of rules provides a complete definition of and metrics for quality in a data set. The rules can be reused on other data sets to maximize efficiency. Finally, the clear definitions and reuseability of these rules allows users to institutionalize knowledge by documenting a data set.

Requires SBT, Spark 1.6, Scala 2.10, Java 8, YARN, Hadoop 2.6, Hbase 0.98, Spray.

Repo Info
Github Repo URL
Github account name blackrock
Repo name TopNotch