question Re: Data quality analysis in Archives of Support Questions (Read Only)

Data quality analysis

Eukrev — Wed, 25 May 2016 11:30:56 GMT

Hi,

Could you share the details on analysing the data quality that is loaded in Hive.

I have got a text file around 250 million records which I have loaded into hive and stored in parquet file. Now my next task is to analyse the quality of data. Since I am not from ETL background, this is new to me. Could you share some details that could be used on Hive tables. I would prefer spark or pig.

Thanks in adavnce!!!

Re: Data quality analysis

rajkumar_singh — Wed, 25 May 2016 12:51:50 GMT

These are some tools to help you cleanse the data and give you insight of the data.

https://www.talend.com/resource/data-quality-tools.html

https://www.trifacta.com/

alternatively you can write some custom script to know the qualitative analysis.

Re: Data quality analysis

Eukrev — Wed, 25 May 2016 12:55:05 GMT

Thank you. Do you know any generic scripts developed in spark for data profiling and data cleaning, that you can share?