Support Questions

prodgers125 · ‎07-31-2016

I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase: 1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS 2) Put Java/Python independentely read the new files and perform the data quality activities 3) If the Data Quality tests return sucessfully load the files into Hive In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project? Many thanks for your help!

bleonhardi · ‎08-01-2016

Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?

http://pig.apache.org/docs/r0.16.0/udf.html#udf-java

There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.

View solution in original post

bleonhardi · ‎08-01-2016

Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?

http://pig.apache.org/docs/r0.16.0/udf.html#udf-java

There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.

prodgers125 · ‎08-01-2016

Hi Benjamin,

I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.

aervits · ‎08-01-2016

Paste the error you're getting

prodgers125 · ‎08-03-2016

If I use Python inside a file.py in my HDFS I can run Pytho UDFs but with Java I'm getting error... I think I'm not getting all the files

bleonhardi · ‎08-03-2016

yeah but without the error we cannot really help. I suppose you mean a classnotfound exception? So your udf uses a lot of exotic imports?

prodgers125 · ‎08-04-2016

I was missing some Jar files 🙂

Cloudera Community

Support Questions

Big Data Analytics - Approach for Data Quality phase