Created 07-12-2017 01:30 PM
Hi,
In my requirement ,reading the table from hive(Size - around 1 TB)i have to do too many aggregation operation mostly avg & sum. I tried following code.its running for long time .Is there another way to optimize or efficient wasy of handling multiple agg operatin
finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),
avg($"neg"),avg($"Rd"),avg("savg"),avg("slavg"),avg($"dex"),avg("cur"),avg($"Nexp"), avg($"NExpp"),avg($"Psat"),
avg($"Pexps"),avg($"Pxn"),avg($"Pn"),avg($"AP3"),avg($"APd"),avg($"RInd"),avg($"CP"),avg($"CScr"),
avg($"Fspct7p1"), avg($"Fspts7p1"),avg($"TlpScore"),avg($"Ordrs"),avg($"Drs"),
avg("Lns"),avg("Judg"),avg("ds"),
avg("ob"),sum("Ss"),sum("dol"),sum("liens"),sum("pct"),
sum("jud"),sum("sljd"),sum("pNB"),avg("pctt"),sum($"Dolneg"),sum("Ls"),sum("sl"),sum($"PA"),sum($"DS"),
sum($"DA"),sum("dcur"),sum($"sat"),sum($"Pes"),sum($"Pn"),sum($"Pn"),sum($"Dlo"),sum($"Dol"),sum("pdol"),sum("pct"),sum("judg"))
Note - I am using Spark scala
Created 07-12-2017 07:39 PM
How long does it take? I'm wondering if you could give us the number, for examples, for your query and a simplifed your query like the following.
q1 = finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(...) // Your query q2 = finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(count($"Dseq"),avg($"Emp"),sum("Ss")) // Simplified your query.
Created 07-30-2017 11:37 AM
Hi Dongjoon,
My query is used nearly 200 aggregation functions. Ofcourse, if i reduce number of aggregations performance would improve. But if i do simplify query as mentioned above, how many query do i need to create and i am wondering that multiple times group by operation is also involving.
Regards,
Kannan