Sqoop 1 has a nice option called --skip-dist-cache that prevents Sqoop from copying its distributed cache every time for its MapReduce2 job to execute. The application jar is always copied and the "libjars" sub-directory is created for the mentioned dependencies (100s of MB of files, slowing down the execution every time). As far as I can tell the jar dependencies are linked here: /opt/cloudera/parcels/CDH/lib/sqoop/lib/ All nodes already have this path thanks to CM, but it would be a small effort to push the jars once every upgrade to a set HDFS folder for use with the distributed cache. Oozie somehow does this very thing but it's not necessarily documented for others to do. Does anyone know how to set a local or HDFS path for Sqoop's MR distributed cache? Maybe it is an unexposed setting in the MR2/Yarn job that sqoop creates on the destination side.
... View more
For a UDA (user-defined aggregate function), I understand that the Impala execution units need to update() data within their own respective threads after calling init() for a locally persisting variable. I also understand that the accumulated data are merged between threads and/or nodes before being serialized and finalized on their way to the client. In more complicated cases, multi-variable structures seem to be housed within StringVal to ensure Impala knows about them. My question is, when does the serialize() function actually take place? Does serialize() happen before or after merge()? The docs make me think it happens after merge(); however, if I have a complicated data structure (set, map, struct, etc) with space allocated on the heap, it makes sense that Impala will not know about that data unless it is first serialized into a space Impala can see. Is there a good book, online doc, or diagram that better describes the function flow from init() to finalize() for an aggregate query execution? A simple diagram would be of great help. It would be great to be able to use an object that can grows itself (map, set, etc.) within an aggregate function's execution thread, particularly within the update() and merge(). However, if the serialization component requires translating the contents to a string before the merge, it may or may not be worth the trouble.
... View more