Support Questions
Find answers, ask questions, and share your expertise

Data Quality solutions to check duplicate values in Data level


Hi All,


I am working on a solution where I am trying to check duplicates or repeated values in the Data field level like same value present in multiple columns. if same value is present we need to check if the relationship with the column is valid or not. If it doesn't have any relationship it needs to throw a flag saying check your data. Below table might be an example 


IDNameFathers NameAddress


Bravo Street


In the above example if 'bravo' is repeated in any other column of the table we need to check why it is getting repeated or if the relationship is correct. Also, overall relationships between data and columns should be checked.


This activity is to have a good quality of the data we are ingesting. and we need to apply this while ingesting in hive from an external DBMS.


My question:

Should I write a business logic or code to achieve these constraints or is there any Data quality tool available to check the data checks in Big data while Data Ingestion.


any leads on any tools, algorithms, code will be really helpful for me. If anyone has already seen such a thing please feel free to share. It would be of great help !


Thank you!