Created 05-25-2016 04:30 AM
Hi,
Could you share the details on analysing the data quality that is loaded in Hive.
I have got a text file around 250 million records which I have loaded into hive and stored in parquet file. Now my next task is to analyse the quality of data. Since I am not from ETL background, this is new to me. Could you share some details that could be used on Hive tables. I would prefer spark or pig.
Thanks in adavnce!!!
Created 05-25-2016 05:51 AM
These are some tools to help you cleanse the data and give you insight of the data.
https://www.talend.com/resource/data-quality-tools.html
alternatively you can write some custom script to know the qualitative analysis.
Created 05-25-2016 05:51 AM
These are some tools to help you cleanse the data and give you insight of the data.
https://www.talend.com/resource/data-quality-tools.html
alternatively you can write some custom script to know the qualitative analysis.
Created 05-25-2016 05:55 AM
Thank you. Do you know any generic scripts developed in spark for data profiling and data cleaning, that you can share?