Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Function execution flow in UDAs and memory implications for using complex data structures

Function execution flow in UDAs and memory implications for using complex data structures

New Contributor

For a UDA (user-defined aggregate function), I understand that the Impala execution units need to update() data within their own respective threads after calling init() for a locally persisting variable. I also understand that the accumulated data are merged between threads and/or nodes before being serialized and finalized on their way to the client. In more complicated cases, multi-variable structures seem to be housed within StringVal to ensure Impala knows about them.

 

  • My question is, when does the serialize() function actually take place? Does serialize() happen before or after merge()? The docs make me think it happens after merge(); however, if I have a complicated data structure (set, map, struct, etc) with space allocated on the heap, it makes sense that Impala will not know about that data unless it is first serialized into a space Impala can see.
  • Is there a good book, online doc, or diagram that better describes the function flow from init() to finalize() for an aggregate query execution? A simple diagram would be of great help.


It would be great to be able to use an object that can grows itself (map, set, etc.) within an aggregate function's execution thread, particularly within the update() and merge(). However, if the serialization component requires translating the contents to a string before the merge, it may or may not be worth the trouble.

1 REPLY 1

Re: Function execution flow in UDAs and memory implications for using complex data structures

Master Collaborator

Your intuition is right for the common case where query execution is entirely in memory - it happens after Merge() before the data needs to be sent over the network. However, if the data is being spilled to disk, the sequence of calls changes - Serialize() may be called before merge in order to write the intermediate value to disk to free up memory.

 

We don't have a flow chart or similar for this, but I think it's a good idea. I filed https://issues.apache.org/jira/browse/IMPALA-8405