Created 07-31-2016 07:23 PM
I doing a small project in Hadoop which the main goal is create some KPI in Hive. However I needed to do some ETL jobs using Pig to clean my data and I put the transformed files into a new directory in HDFS. To ensure that all the files are in correct form, I want to create some data quality activities in Java or Python. I tried to to use PIG UDFs to achieve this but I couldn't connect the Jar file with Pig. Since I can't use PIG UDFs, I'm planning a new approach to do the data quality phase: 1) Run the PIG scripts to clean the data and extract the new files into a new directory in HDFS 2) Put Java/Python independentely read the new files and perform the data quality activities 3) If the Data Quality tests return sucessfully load the files into Hive In your opinion this a good approach for a Big Data project? I'm new in this topic... If not, what a good alternative for perform data quality jobs in this project? Many thanks for your help!
Created 08-01-2016 10:49 AM
Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?
http://pig.apache.org/docs/r0.16.0/udf.html#udf-java
There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.
Created 08-01-2016 10:49 AM
Running UDFs in pig is what pig is for. You should fix that problem. Have you registered your jars?
http://pig.apache.org/docs/r0.16.0/udf.html#udf-java
There are other possibilities as well, Spark comes to mind esp. with python it can be relatively easy to setup ( although it also has its problems like python versions ) And there are some ETL tools that can utilize hadoop. But by and large pig with java udfs is a very straight forward way to do custom data cleaning on data in hadoop. There is no reason you shouldn't get it to work.
Created 08-01-2016 10:50 AM
Hi Benjamin,
I follow that steps to include Java UDFs in Pig but it always gives me error... that's way I'm looking for alternatives.
Created 08-01-2016 11:53 PM
Paste the error you're getting
Created 08-03-2016 08:07 AM
If I use Python inside a file.py in my HDFS I can run Pytho UDFs but with Java I'm getting error... I think I'm not getting all the files
Created 08-03-2016 02:56 PM
yeah but without the error we cannot really help. I suppose you mean a classnotfound exception? So your udf uses a lot of exotic imports?
Created 08-04-2016 10:46 AM
I was missing some Jar files 🙂