Support Questions

Find answers, ask questions, and share your expertise

Is Python Script better or Hive UDF?

avatar
Contributor

Hi,

I have a job which needs me to pull JSON file from a Hive table. After calling the file, there are business logic(calculations) which needs to be done on the file. Once the process is done the result needs to be captured in a JSON fle and store it back in Hive table. After processing (in the code) for every ID taken in there will 100 -to- 5000 records generated. Which needs to be taken in JSON File and inserted back in Hive.To accomplish the above task will writing a Python script be beneficial or a Hive UDF(Java code)? Business wants it to be done in Hive. Any help or suggestion is highly appreciated.

1 ACCEPTED SOLUTION

avatar
@Vijay Parmar

If I was solving the problem I would look at using pig for the job.

Use HCatLoader to load the data from hive table. Do all sorts of operation; ideally complex:)

Then store it back to hive using HCatStorer.

Look at : https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-HCatLoader

Why Pig: 3 main reasons:

1. Very easy to program and easy to maintain for the same reason.

2. Optimized code execution. This is my personal favorite. What it means is pig will execute even a badly written series of steps (Think of doing duplicate operations, unnecessary variable allocation etc) in a very optimized way.

3. You can go as complex as you want by using PiggyBank custom functions and also write your own udf.

Am not saying hive or python will not do the job but the software called Pig is a specialist in this kind of situations.

But do remember I mentioned all this since you asked about writing udfs which made me assume that this has a fair bit of complexity. If the transformation is simple means you can somehow fit it in a single hive query I would close my eyes and use that.

Thanks

View solution in original post

6 REPLIES 6

avatar
@Vijay Parmar

If I was solving the problem I would look at using pig for the job.

Use HCatLoader to load the data from hive table. Do all sorts of operation; ideally complex:)

Then store it back to hive using HCatStorer.

Look at : https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-HCatLoader

Why Pig: 3 main reasons:

1. Very easy to program and easy to maintain for the same reason.

2. Optimized code execution. This is my personal favorite. What it means is pig will execute even a badly written series of steps (Think of doing duplicate operations, unnecessary variable allocation etc) in a very optimized way.

3. You can go as complex as you want by using PiggyBank custom functions and also write your own udf.

Am not saying hive or python will not do the job but the software called Pig is a specialist in this kind of situations.

But do remember I mentioned all this since you asked about writing udfs which made me assume that this has a fair bit of complexity. If the transformation is simple means you can somehow fit it in a single hive query I would close my eyes and use that.

Thanks

avatar
Contributor

@rbiswas Thank you for detailing out the things. Yes, you are correct there is lot of complexity involved in this. As the JSON itself is in a very complex format. After processing (in the code) for every ID taken in there will 100 -to- 5000 records generated. Which needs to be taken/ captured in JSON File and inserted back in Hive.The situation is that, I have to choose either from Python or Hive. So, out of these 2 using which one will be more helpful in terms of performance and complexity?

avatar

@Vijay Parmar

First try to fit the transformation in one hive query by using the common functions. If that is not possible or becomes very complicated,

go with hive udf since it will be better in terms of reusability. Now you can write the udf either in python or java.

It is very difficult to comment on which one would be faster since it would depend on the implementation.

Go with the language you are more comfortable with.

Here is an example of a python udf:

https://github.com/Azure/azure-content/blob/master/articles/hdinsight/hdinsight-python.md

Thanks

avatar
Contributor

@rbiswas Thank you. As it involves lot of complexity and the only best solution as of now is to write UDF.

avatar
Super Guru

You can always write the Hive UDF in Python. A Java UDF may yield better performance overall, but I prefer Python UDFs for the ease of development and maintainence.

avatar
Contributor

@Michael Young Due to complexity going for Python would be better than Java. Thank you for the suggestion.