Support Questions

Eukrev · ‎05-25-2016

Hi,

Could you share the details on analysing the data quality that is loaded in Hive.

I have got a text file around 250 million records which I have loaded into hive and stored in parquet file. Now my next task is to analyse the quality of data. Since I am not from ETL background, this is new to me. Could you share some details that could be used on Hive tables. I would prefer spark or pig.

Thanks in adavnce!!!

rajkumar_singh · ‎05-25-2016

These are some tools to help you cleanse the data and give you insight of the data.

https://www.talend.com/resource/data-quality-tools.html

https://www.trifacta.com/

alternatively you can write some custom script to know the qualitative analysis.

View solution in original post

rajkumar_singh · ‎05-25-2016

These are some tools to help you cleanse the data and give you insight of the data.

https://www.talend.com/resource/data-quality-tools.html

https://www.trifacta.com/

alternatively you can write some custom script to know the qualitative analysis.

Eukrev · ‎05-25-2016

Thank you. Do you know any generic scripts developed in spark for data profiling and data cleaning, that you can share?

Cloudera Community

Support Questions

Data quality analysis