Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Big Data Analytics - Approach for Data Quality phase

Explorer

I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase: 1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS 2) Put Java/Python independentely read the new files and perform the data quality activities 3) If the Data Quality tests return sucessfully load the files into Hive In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project? Many thanks for your help!

1 ACCEPTED SOLUTION

Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?

http://pig.apache.org/docs/r0.16.0/udf.html#udf-java

There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.

View solution in original post

6 REPLIES 6

Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?

http://pig.apache.org/docs/r0.16.0/udf.html#udf-java

There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.

Explorer

Hi Benjamin,

I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.

Mentor

Paste the error you're getting

Explorer

If I use Python inside a file.py in my HDFS I can run Pytho UDFs but with Java I'm getting error... I think I'm not getting all the files

yeah but without the error we cannot really help. I suppose you mean a classnotfound exception? So your udf uses a lot of exotic imports?

Explorer

I was missing some Jar files 🙂

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.