Support Questions

Find answers, ask questions, and share your expertise

Hive Storage Handlers

avatar
Contributor

I have a technical question regarding the implementation of Hive storage handlers.

It's apparent to me now that the following methods in the storage handler are called multiple times in all mappers.

-- MySerDe.initialize()

-- MySerDe.getObjectInspector()

In my case, getting the metadata over and over again is expensive since it requires calls to remote systems. Therefore I wonder if there is a way to

-- somehow save or cache a previously obtained ObjectInspector (since it doesn't change)?

-- or simply pass a string representing the metadata to the MySerDe class in all mappers, so that I can contruct the ObjectInspector using the string instead of making remote calls?

-- etc.?

Thanks in advance for any help.

1 ACCEPTED SOLUTION

avatar
Master Guru

You should be able to simply cache your access objects as a class variable and only create it if it has not been created already. You just need to be a bit careful since the functions are sometimes called with empty conf objects first. Thats what the jdbc storage handler does.

View solution in original post

6 REPLIES 6

avatar
Master Guru

You should be able to simply cache your access objects as a class variable and only create it if it has not been created already. You just need to be a bit careful since the functions are sometimes called with empty conf objects first. Thats what the jdbc storage handler does.

avatar
Contributor

Hi @Benjamin Leonhardi, thanks kindly for the response. I was able to resolve it via the JobConf object since it is persistent throughout. I save the string form of the metadata in the JobConf the first time and then only need to read from it in the mappers. Of course, I can also simply check whether this string exists in the JobConf in order to know whether I'm doing it for the first time or whether I'm in mappers. This all sounds reasonable to you?

I would still like to understand your approach. By access objects did you mean the OI? As far as I could tell class objects will always be null when called from mappers?

BTW I have another critical question and I would appreciate your comments there as well -- https://community.hortonworks.com/questions/43603/hive-storage-handlers-control-thread.html.

avatar
Contributor

Hi @Benjamin Leonhardi, I still have the following questions regarding this topic and would appreciate your comments:

1) So why exactly is initialize()/getObjectInspector() called many times inside mappers? They are called after getRecordReader() which seems even more confusing to me...

2) Assuming there is nothing we can do about the above behavior, do you know of a good way for my code to tell whether I'm inside the mappers or prior to that (in other words, am I calling initialize() for the first time or not?)? As mentioned in my previous comment, I'm currently relying on the jobConf object to tell me that, but I would like to get rid of this dependency if possible...

Thanks.

avatar
Master Guru

yeah not sure why they call it multiple times. I think the record reader classes are simply initiated multiple times during split and other creation for various reasons. In the end your code needs to be able to survive empty calls and avoid duplication of connection objects. I know of no ways to fix this. In my example I could see relatively easily if the conf object was valid because my config fields ( the storage handler parameters ) where not always in the object. I then simply initialized the connection object and made sure not to create it a second time.

avatar
Contributor

Hi @Benjamin Leonhardi,

I notice that if hive.execution.engine = tez, then SerDe.initialize() is NOT called from the mappers at all (i.e. it goes directly to deserilize()), which is causing me problems. Did you know whether this is expected and what the reasoning is? Thank you.

avatar
Contributor

@Benjamin Leonhardi

actually I missed something, it's still called when using tez. Never mind the question.