- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Is Python Script better or Hive UDF?
- Labels:
-
Apache Hive
Created ‎07-01-2016 05:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a job which needs me to pull JSON file from a Hive table. After calling the file, there are business logic(calculations) which needs to be done on the file. Once the process is done the result needs to be captured in a JSON fle and store it back in Hive table. After processing (in the code) for every ID taken in there will 100 -to- 5000 records generated. Which needs to be taken in JSON File and inserted back in Hive.To accomplish the above task will writing a Python script be beneficial or a Hive UDF(Java code)? Business wants it to be done in Hive. Any help or suggestion is highly appreciated.
Created ‎07-01-2016 06:36 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I was solving the problem I would look at using pig for the job.
Use HCatLoader to load the data from hive table. Do all sorts of operation; ideally complex:)
Then store it back to hive using HCatStorer.
Look at : https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-HCatLoader
Why Pig: 3 main reasons:
1. Very easy to program and easy to maintain for the same reason.
2. Optimized code execution. This is my personal favorite. What it means is pig will execute even a badly written series of steps (Think of doing duplicate operations, unnecessary variable allocation etc) in a very optimized way.
3. You can go as complex as you want by using PiggyBank custom functions and also write your own udf.
Am not saying hive or python will not do the job but the software called Pig is a specialist in this kind of situations.
But do remember I mentioned all this since you asked about writing udfs which made me assume that this has a fair bit of complexity. If the transformation is simple means you can somehow fit it in a single hive query I would close my eyes and use that.
Thanks
Created ‎07-01-2016 06:36 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I was solving the problem I would look at using pig for the job.
Use HCatLoader to load the data from hive table. Do all sorts of operation; ideally complex:)
Then store it back to hive using HCatStorer.
Look at : https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-HCatLoader
Why Pig: 3 main reasons:
1. Very easy to program and easy to maintain for the same reason.
2. Optimized code execution. This is my personal favorite. What it means is pig will execute even a badly written series of steps (Think of doing duplicate operations, unnecessary variable allocation etc) in a very optimized way.
3. You can go as complex as you want by using PiggyBank custom functions and also write your own udf.
Am not saying hive or python will not do the job but the software called Pig is a specialist in this kind of situations.
But do remember I mentioned all this since you asked about writing udfs which made me assume that this has a fair bit of complexity. If the transformation is simple means you can somehow fit it in a single hive query I would close my eyes and use that.
Thanks
Created ‎07-01-2016 06:44 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@rbiswas Thank you for detailing out the things. Yes, you are correct there is lot of complexity involved in this. As the JSON itself is in a very complex format. After processing (in the code) for every ID taken in there will 100 -to- 5000 records generated. Which needs to be taken/ captured in JSON File and inserted back in Hive.The situation is that, I have to choose either from Python or Hive. So, out of these 2 using which one will be more helpful in terms of performance and complexity?
Created ‎07-01-2016 06:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First try to fit the transformation in one hive query by using the common functions. If that is not possible or becomes very complicated,
go with hive udf since it will be better in terms of reusability. Now you can write the udf either in python or java.
It is very difficult to comment on which one would be faster since it would depend on the implementation.
Go with the language you are more comfortable with.
Here is an example of a python udf:
https://github.com/Azure/azure-content/blob/master/articles/hdinsight/hdinsight-python.md
Thanks
Created ‎07-03-2016 07:24 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@rbiswas Thank you. As it involves lot of complexity and the only best solution as of now is to write UDF.
Created ‎07-02-2016 09:43 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can always write the Hive UDF in Python. A Java UDF may yield better performance overall, but I prefer Python UDFs for the ease of development and maintainence.
Created ‎07-03-2016 07:30 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Michael Young Due to complexity going for Python would be better than Java. Thank you for the suggestion.
