Support Questions

Find answers, ask questions, and share your expertise

Big Data Analytics - Approach for Data Quality phase

avatar
Rising Star

I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase: 1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS 2) Put Java/Python independentely read the new files and perform the data quality activities 3) If the Data Quality tests return sucessfully load the files into Hive In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project? Many thanks for your help!

1 ACCEPTED SOLUTION

avatar
Master Guru

Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?

http://pig.apache.org/docs/r0.16.0/udf.html#udf-java

There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.

View solution in original post

6 REPLIES 6

avatar
Master Guru

Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?

http://pig.apache.org/docs/r0.16.0/udf.html#udf-java

There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.

avatar
Rising Star

Hi Benjamin,

I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.

avatar
Master Mentor

Paste the error you're getting

avatar
Rising Star

If I use Python inside a file.py in my HDFS I can run Pytho UDFs but with Java I'm getting error... I think I'm not getting all the files

avatar
Master Guru

yeah but without the error we cannot really help. I suppose you mean a classnotfound exception? So your udf uses a lot of exotic imports?

avatar
Rising Star

I was missing some Jar files 🙂