Created 06-30-2016 06:55 AM
I have a technical question regarding the implementation of Hive storage handlers.
It's apparent to me now that the following methods in the storage handler are called multiple times in all mappers.
-- MySerDe.initialize()
-- MySerDe.getObjectInspector()
In my case, getting the metadata over and over again is expensive since it requires calls to remote systems. Therefore I wonder if there is a way to
-- somehow save or cache a previously obtained ObjectInspector (since it doesn't change)?
-- or simply pass a string representing the metadata to the MySerDe class in all mappers, so that I can contruct the ObjectInspector using the string instead of making remote calls?
-- etc.?
Thanks in advance for any help.
Created 06-30-2016 05:17 PM
You should be able to simply cache your access objects as a class variable and only create it if it has not been created already. You just need to be a bit careful since the functions are sometimes called with empty conf objects first. Thats what the jdbc storage handler does.
Created 06-30-2016 05:17 PM
You should be able to simply cache your access objects as a class variable and only create it if it has not been created already. You just need to be a bit careful since the functions are sometimes called with empty conf objects first. Thats what the jdbc storage handler does.
Created 07-06-2016 12:51 AM
Hi @Benjamin Leonhardi, thanks kindly for the response. I was able to resolve it via the JobConf object since it is persistent throughout. I save the string form of the metadata in the JobConf the first time and then only need to read from it in the mappers. Of course, I can also simply check whether this string exists in the JobConf in order to know whether I'm doing it for the first time or whether I'm in mappers. This all sounds reasonable to you?
I would still like to understand your approach. By access objects did you mean the OI? As far as I could tell class objects will always be null when called from mappers?
BTW I have another critical question and I would appreciate your comments there as well -- https://community.hortonworks.com/questions/43603/hive-storage-handlers-control-thread.html.
Created 07-06-2016 09:50 PM
Hi @Benjamin Leonhardi, I still have the following questions regarding this topic and would appreciate your comments:
1) So why exactly is initialize()/getObjectInspector() called many times inside mappers? They are called after getRecordReader() which seems even more confusing to me...
2) Assuming there is nothing we can do about the above behavior, do you know of a good way for my code to tell whether I'm inside the mappers or prior to that (in other words, am I calling initialize() for the first time or not?)? As mentioned in my previous comment, I'm currently relying on the jobConf object to tell me that, but I would like to get rid of this dependency if possible...
Thanks.
Created 07-07-2016 11:49 AM
yeah not sure why they call it multiple times. I think the record reader classes are simply initiated multiple times during split and other creation for various reasons. In the end your code needs to be able to survive empty calls and avoid duplication of connection objects. I know of no ways to fix this. In my example I could see relatively easily if the conf object was valid because my config fields ( the storage handler parameters ) where not always in the object. I then simply initialized the connection object and made sure not to create it a second time.
Created 07-14-2016 01:01 AM
I notice that if hive.execution.engine = tez, then SerDe.initialize() is NOT called from the mappers at all (i.e. it goes directly to deserilize()), which is causing me problems. Did you know whether this is expected and what the reasoning is? Thank you.
Created 07-14-2016 09:47 PM
actually I missed something, it's still called when using tez. Never mind the question.