I find impala will give wrong answer if the result of Hive UDF is used in group by statement. The impala version is: 2.7.0-cdh5-IMPALA_KUDU-cdh5 RELEASE. Here is the procedure to reproduce the error:
impala> create table test_escape_group_by (s string);
impala> insert into table test_escape_group_by values("longstring"), ("short");
impala> select my_escape_string(s) as es from test_escape_group_by;
longstring
short
impala> select my_escape_string(s) as es from test_escape_group_by group by es;
shorttring
short
We can see that the beginning part of 'longstring' is replaced by 'short'. Here is the definition of my_escape_string:
public class MyEscapeString extends UDF
{
public Text evaluate(Text para) throws ParseException {
if ((null == para) || ("".equals(para.toString()))) {
return new Text("");
}
return new Text(para.toString().replace("\\", "\\\\").replace("\"", "\\\""));
}
}
My Question: Is this a bug of impala, or how can I rewritten the Java UDF to avoid such errors.