Created on 07-06-2015 02:23 PM - edited 09-16-2022 02:33 AM
Hi,
I have been working on creating custom User Defined Aggregate Functions (UDAF), looking at the example provided here.
Here Variance and standard deviation is calculated on a Impala column.
Since we have quite a lot varibles to calculate like sum, sum of squares and count, we use a C++ struct and serialize it as a string, so that data is passed to init, update, merge and finalize phases.
My questions:
1) Can we have a Double return type instead of String.
2) Where can we find the implementations of Impala buitl-in functions like min(), sum(), max() since these functions return double.
Any suggestions are welcome. Thanks !!!
Created 07-13-2015 07:03 PM
1) The Knuth variance is an Impala built-in. Internally, Impala can handle aggregate functions with different intermediate and output types.
Basically, the only reason you are not allowed to create UDAs with different intermediate/output types is because we have not enabled the feature in sermantic analysis.
For us, enabling the feature is the easy part. Adding extensive testing is the hard part.
If you are curious, the check for preventing you from creating such UDAs is in:
./fe/src/main/java/com/cloudera/impala/analysis/CreateUdaStmt.java
lines 137 following
2) Like I said, enabling the feature is not hard, but does involve a non-trivial QA effort, so I cannot promise a concrete release at this point. I'd recommend keeping an eye on that JIRA for updates to the target version.
Created 07-07-2015 01:13 AM
I'm afraid you may have to wait until we resolve:
https://issues.cloudera.org/browse/IMPALA-1829
For the impala builtins you can have a look at:
IMPALA_HOME/be/src/exprs/aggregate-functions.h
IMPALA_HOME/be/src/exprs/aggregate-functions.cc
Created on 07-07-2015 07:37 AM - edited 07-07-2015 07:41 AM
Hi Alex,
Thanks for your fast response.
I have couple of questions more 😛
1) I see that the KnuthVariance returns Double, but when I try it in my code having Finalize function return a Double I get,
Analysis Exception: Could not find function func_nameUpdate(double,double,double) returns double in 'HDFS_so_filepath' Check that function name, agruments and return types are correct.
I am curious how the in-built functions have that feature.
Pls do let me know if I am missing something.
2) If cloudera needs to fix it, pls let me know on which CDH and impala version, fix might be released. Thanks !!!
Created 07-13-2015 07:03 PM
1) The Knuth variance is an Impala built-in. Internally, Impala can handle aggregate functions with different intermediate and output types.
Basically, the only reason you are not allowed to create UDAs with different intermediate/output types is because we have not enabled the feature in sermantic analysis.
For us, enabling the feature is the easy part. Adding extensive testing is the hard part.
If you are curious, the check for preventing you from creating such UDAs is in:
./fe/src/main/java/com/cloudera/impala/analysis/CreateUdaStmt.java
lines 137 following
2) Like I said, enabling the feature is not hard, but does involve a non-trivial QA effort, so I cannot promise a concrete release at this point. I'd recommend keeping an eye on that JIRA for updates to the target version.