Created on 09-13-2017 11:06 AM - edited 09-16-2022 05:14 AM
I am writing a UDA in Impala and have gone through the documentation
I am not able to understand clearly the way memory is to be allocated in Impala UDAs.
In udf.h (links to source) , its mentioned,
For allocations that are not returned to Impala, the UDA should use the FunctionContext::Allocate()/Free() methods. For StringVal allocations returned to Impala (e.g. returned by UdaSerialize()), the UDA should allocate the result via StringVal(FunctionContext*, int) ctor or the function StringVal::CopyFrom(FunctionContext*, const uint8_t*, size_t)
I'm looking at the StringConcat example in the samples and am confused by the two variants of the StringConcat example.
At one place (links to source) , Allocate and Free are used,
void StringConcatUpdate(FunctionContext* context, const StringVal& str, const StringVal& separator, StringVal* result) { ... uint8_t* copy = context->Allocate(str.len); ... }
whereas at another place (links to source) StringVal::CopyFrom and StringVal constructor are used.
void StringConcatUpdate(FunctionContext* context, const StringVal& arg1, const StringVal& arg2, StringVal* val) { ... *val = StringVal::CopyFrom(context, arg1.ptr, arg1.len); ... StringVal new_val(context, new_len); ... }
Both of these are exactly the same example !
As per the comment in udf.h, Allocate/Free is to be used when allocations are not returned to Impala. The Update function doesnt return allocation to Impala. The second example which i pointed to uses StringVal method whereas the first one uses Allocate/Free.
Question 1) Can Allocate/Free and the constructor method be used interchangably ? Which of these 2 examples is correct from a memory allocation perspective ?
There is another function - "Avg" implemented differently w.r.t memory allocations.
In the first example, Allocate is used
void AvgInit(FunctionContext* context, StringVal* val) { val->is_null = false; val->len = sizeof(AvgStruct); val->ptr = context->Allocate(val->len); memset(val->ptr, 0, val->len); }
whereas in the second example nothing is used to allocate memory (not even constructor).
void AvgInit(FunctionContext* context, BufferVal* val) { static_assert(sizeof(AvgStruct) == 16, "AvgStruct is an unexpected size"); memset(*val, 0, sizeof(AvgStruct)); }
Question 2) How does this one work without any memory allocation ?
Created 09-14-2017 02:42 PM
Very astute questions!
The version of StringConcatUpdate() in impala-udf-samples is correct. The use of the "local" allocation in the second version of StringConcatUpdate() is incorrect. I filed a bug to correct that: https://issues.apache.org/jira/browse/IMPALA-5939. The problem is that the StringVal() constructor and StringVal::CopyFrom() use AllocateLocal() behind the scenes. Your UDA does not own the memory returned by AllocateLocal() and it will be automatically cleaned up by Impala at some point after your Update function returns.
It's a bit unfortunate that the two sets of examples have diverged. I recommend looking at https://github.com/cloudera/impala-udf-samples/ because that's intended to be the public-facing version and I think is more up to date. You may be also be interested in this PR https://github.com/cloudera/impala-udf-samples/pull/18, which improves the UDF examples to better handle failed memory allocations.
With regards to 2). That is a builtin aggregate function that uses some internal functionality that we added recently. Some builtin functions only require a fixed-size intermediate value, so there's a way to declare this and have it preallocated by the Impala runtime. That functionality isn't exposed to UDAs for now.
Created 09-14-2017 02:42 PM
Very astute questions!
The version of StringConcatUpdate() in impala-udf-samples is correct. The use of the "local" allocation in the second version of StringConcatUpdate() is incorrect. I filed a bug to correct that: https://issues.apache.org/jira/browse/IMPALA-5939. The problem is that the StringVal() constructor and StringVal::CopyFrom() use AllocateLocal() behind the scenes. Your UDA does not own the memory returned by AllocateLocal() and it will be automatically cleaned up by Impala at some point after your Update function returns.
It's a bit unfortunate that the two sets of examples have diverged. I recommend looking at https://github.com/cloudera/impala-udf-samples/ because that's intended to be the public-facing version and I think is more up to date. You may be also be interested in this PR https://github.com/cloudera/impala-udf-samples/pull/18, which improves the UDF examples to better handle failed memory allocations.
With regards to 2). That is a builtin aggregate function that uses some internal functionality that we added recently. Some builtin functions only require a fixed-size intermediate value, so there's a way to declare this and have it preallocated by the Impala runtime. That functionality isn't exposed to UDAs for now.
Created on 09-14-2017 09:35 PM - edited 09-14-2017 10:24 PM
Thanks Tim for the detailed and concise reply ! Special thanks for pointing the pull request.
Its really nice to see one of the impala committers answer my query 🙂
The source of all confusion was the header file in impala-udf-samples
https://github.com/cloudera/impala-udf-samples/blob/master/uda-sample.h
// Note: As of Impala 1.2, UDAs must have the same intermediate and result types (see the // udf.h header for the full Impala UDA specification, which can be found at // https://github.com/cloudera/impala/blob/master/be/src/udf/udf.h). ... ...
This made me believe that this repo is old.
Also, the code in .h and .cc file for impala-udf-samples still doesnt use intermediate type that is being used in the samples in https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/udf_samples/uda-sample.cc repo
It would be nice to have impala-udf-samples updated with intermediate types.
Created 09-29-2017 05:29 PM
Yeah I agree - I'd like to spend some time cleaning that up 🙂