question Re: Big Data Analytics - Approach for Data Quality phase in Archives of Support Questions (Read Only)

Big Data Analytics - Approach for Data Quality phase

prodgers125 — Mon, 01 Aug 2016 02:23:34 GMT

I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase: 1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS 2) Put Java/Python independentely read the new files and perform the data quality activities 3) If the Data Quality tests return sucessfully load the files into Hive In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project? Many thanks for your help!

Re: Big Data Analytics - Approach for Data Quality phase

bleonhardi — Mon, 01 Aug 2016 17:49:27 GMT

Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?

http://pig.apache.org/docs/r0.16.0/udf.html#udf-java

There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.

Re: Big Data Analytics - Approach for Data Quality phase

prodgers125 — Mon, 01 Aug 2016 17:50:29 GMT

Hi Benjamin,

I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.

Re: Big Data Analytics - Approach for Data Quality phase

aervits — Tue, 02 Aug 2016 06:53:59 GMT

Paste the error you're getting

Re: Big Data Analytics - Approach for Data Quality phase

prodgers125 — Wed, 03 Aug 2016 15:07:24 GMT

If I use Python inside a file.py in my HDFS I can run Pytho UDFs but with Java I'm getting error... I think I'm not getting all the files

Re: Big Data Analytics - Approach for Data Quality phase

bleonhardi — Wed, 03 Aug 2016 21:56:53 GMT

yeah but without the error we cannot really help. I suppose you mean a classnotfound exception? So your udf uses a lot of exotic imports?

Re: Big Data Analytics - Approach for Data Quality phase

prodgers125 — Thu, 04 Aug 2016 17:46:24 GMT

I was missing some Jar files 🙂