For a UDA (user-defined aggregate function), I understand that the Impala execution units need to update() data within their own respective threads after calling init() for a locally persisting variable. I also understand that the accumulated data are merged between threads and/or nodes before being serialized and finalized on their way to the client. In more complicated cases, multi-variable structures seem to be housed within StringVal to ensure Impala knows about them. My question is, when does the serialize() function actually take place? Does serialize() happen before or after merge()? The docs make me think it happens after merge(); however, if I have a complicated data structure (set, map, struct, etc) with space allocated on the heap, it makes sense that Impala will not know about that data unless it is first serialized into a space Impala can see. Is there a good book, online doc, or diagram that better describes the function flow from init() to finalize() for an aggregate query execution? A simple diagram would be of great help. It would be great to be able to use an object that can grows itself (map, set, etc.) within an aggregate function's execution thread, particularly within the update() and merge(). However, if the serialization component requires translating the contents to a string before the merge, it may or may not be worth the trouble.
... View more