Posts: 9
Registered: ‎05-12-2016

Impala and Hive significant difference in result while calculating correlation coefficient

Hadoop Setup :-
8 node Cluster with CDH 5.9
Hive Version :- Hive 1.1.0-cdh5.9.0
Compiled by jenkins on Fri Oct 21 00:54:46 PDT 2016
From source with checksum 9c5d0bee25fab27d28098c3080f8aedc

Impala Version :-
Impala v2.7.0-cdh5.9.0 (4b4cf19) built on Fri Oct 21 01:07:22 PDT 2016

I ran same query on Hive and Impala (through Hue, Screenshot attached)

SELECT (AVG(cost_of_liquidity_provision*risk_of_liquidity_provision)- AVG(cost_of_liquidity_provision)*AVG(risk_of_liquidity_provision))/ (1.00000000*STDDEV_POP(cost_of_liquidity_provision)*STDDEV_POP(risk_of_liquidity_provision)) AS corr_coeff
FROM liquidity

The table is in parquet format (non partitioned)


Impala and Hive.PNG

I got different output in two different runs
Hive :- 0.8465 (correct as verified by external application e.g. R)
Impala :- 0.0636

Similar thing happened with another query too.

I have created a StackoverFlow Question and Cloudera Impala Issue [IMPALA-4841] for the same.