Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Impala and Hive significant difference in result while calculating correlation coefficient

Highlighted

Impala and Hive significant difference in result while calculating correlation coefficient

Explorer

Hadoop Setup :-
8 node Cluster with CDH 5.9
Hive Version :- Hive 1.1.0-cdh5.9.0
Compiled by jenkins on Fri Oct 21 00:54:46 PDT 2016
From source with checksum 9c5d0bee25fab27d28098c3080f8aedc

Impala Version :-
Impala v2.7.0-cdh5.9.0 (4b4cf19) built on Fri Oct 21 01:07:22 PDT 2016

Issue:-
I ran same query on Hive and Impala (through Hue, Screenshot attached)

 
SELECT (AVG(cost_of_liquidity_provision*risk_of_liquidity_provision)- AVG(cost_of_liquidity_provision)*AVG(risk_of_liquidity_provision))/ (1.00000000*STDDEV_POP(cost_of_liquidity_provision)*STDDEV_POP(risk_of_liquidity_provision)) AS corr_coeff
FROM liquidity

The table is in parquet format (non partitioned)

Screenshot

Impala and Hive.PNG

I got different output in two different runs
Hive :- 0.8465 (correct as verified by external application e.g. R)
Impala :- 0.0636

Similar thing happened with another query too.

I have created a StackoverFlow Question and Cloudera Impala Issue [IMPALA-4841] for the same.

Don't have an account?
Coming from Hortonworks? Activate your account here