- last edited on
I am trying to implement data quality framework in spark using dataframe.The requirement is very simple as below:
1) Pass the column names on which we wants to apply data quality rules.
2) Pass the rules(null check,date format change), threshold of each rule along with source table and database name .
3) while checking the rules ,if it's exceeded then generate an alert or send an email.
4) The failed records must go to perticular directory(HDFS)
5) Remember that, rule should be applied in iterative way(i.e. for all columns which are passed)
Error log table :
Rule id,errorous fields,errorous complete record