We're using a small (6-node) hdfs/impala cluster for development of an analytics project. The one feature we're really missing is percentile aggregations, so alongside avg, min, max, std_dev values we can also see the 95th and 99th percentiles for grouped results.
In the latest What's next for Impala... blog post (July 2015) the "...addition of new SQL and vendor-specific language extensions and data types based on customer feedback" is planned for later in 2016. Are percentile aggregation functions intended to be included, equivalent to Hive's percentile() or percentile_approx() functions?
Alternatively, since Impala 2.3 - released in November - User-Defined Aggregation Functions (UDAFs) have been available. Has anyone written a percentile aggregation function that they would be happy to share or open source?
Does anyone use a third-party technology / library to calculate percentiles with Impala?
Many thanks for any help you can provide.