Member since
12-28-2018
2
Posts
1
Kudos Received
0
Solutions
12-28-2018
07:23 PM
@GEORGE NASIS In case of caluculating average we are not going to get same value. Example: Hive table data: select * from i;
+---------+------+--------+-------+--+
| gender | age | total | name |
+---------+------+--------+-------+--+
| M | 10 | 12 | st |
| F | 10 | 8 | st |
| M | 12 | 15 | st |
+---------+------+--------+-------+--+ Caluculate Avg on one field: select avg(total) from i group by name;
+---------------------+--+
| _c0 |
+---------------------+--+
| 11.666666666666666 |
+---------------------+--+ Calculate avg using subquery: select avg(tt) from (
select name,avg(total)tt from i group by gender,name)t group by name;
+--------+--+
| _c0 |
+--------+--+
| 10.75 |
+--------+--+ as we are doing 2 dimensions in innerquery then 1D group by on outer query we can see the difference between averages because number of rows are 2 in outer query instead of 3. Correct subquery would be: select sum(tt)/sum(cnt) from( select name,sum(total)tt,count(*)cnt,avg(total) from i group by gender,name //add cnt column,sum,avg)t group by name;
+---------------------+--+
| _c0 |
+---------------------+--+
| 11.666666666666666 |
+---------------------+--+ You can try with this approach (adding count column then sum(total) divide sum(count) to get correct avg results) and cache the most appropriate dataframe then spark optimizer will choose the most efficient plan to run these tasks.
... View more