About giorgionasis

Shu_ashu · ‎12-28-2018

@GEORGE NASIS In case of caluculating average we are not going to get same value. Example: Hive table data: select * from i; +---------+------+--------+-------+--+ | gender | age | total | name | +---------+------+--------+-------+--+ | M | 10 | 12 | st | | F | 10 | 8 | st | | M | 12 | 15 | st | +---------+------+--------+-------+--+ Caluculate Avg on one field: select avg(total) from i group by name; +---------------------+--+ | _c0 | +---------------------+--+ | 11.666666666666666 | +---------------------+--+ Calculate avg using subquery: select avg(tt) from ( select name,avg(total)tt from i group by gender,name)t group by name; +--------+--+ | _c0 | +--------+--+ | 10.75 | +--------+--+ as we are doing 2 dimensions in innerquery then 1D group by on outer query we can see the difference between averages because number of rows are 2 in outer query instead of 3. Correct subquery would be: select sum(tt)/sum(cnt) from( select name,sum(total)tt,count(*)cnt,avg(total) from i group by gender,name //add cnt column,sum,avg)t group by name; +---------------------+--+ | _c0 | +---------------------+--+ | 11.666666666666666 | +---------------------+--+ You can try with this approach (adding count column then sum(total) divide sum(count) to get correct avg results) and cache the most appropriate dataframe then spark optimizer will choose the most efficient plan to run these tasks.

Online	Offline
Last Visited	‎12-28-2018 09:28 AM

Member Since	‎12-28-2018 12:31 AM
Last Visited	‎12-28-2018 09:28 AM
Posts	2
Kudos received	1

Cloudera Community

Re: Scala - Spark Boost GroupBy Computing for mult...