Support Questions
Find answers, ask questions, and share your expertise

how to handle too many aggregation operation in spark scala?

Contributor

Hi,

In my requirement ,reading the table from hive(Size - around 1 TB)i have to do too many aggregation operation mostly avg & sum. I tried following code.its running for long time .Is there another way to optimize or efficient wasy of handling multiple agg operatin

  1. finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),
     avg($"neg"),avg($"Rd"),avg("savg"),avg("slavg"),avg($"dex"),avg("cur"),avg($"Nexp"), avg($"NExpp"),avg($"Psat"),
     avg($"Pexps"),avg($"Pxn"),avg($"Pn"),avg($"AP3"),avg($"APd"),avg($"RInd"),avg($"CP"),avg($"CScr"),
     avg($"Fspct7p1"), avg($"Fspts7p1"),avg($"TlpScore"),avg($"Ordrs"),avg($"Drs"),
     avg("Lns"),avg("Judg"),avg("ds"),
     avg("ob"),sum("Ss"),sum("dol"),sum("liens"),sum("pct"),
     sum("jud"),sum("sljd"),sum("pNB"),avg("pctt"),sum($"Dolneg"),sum("Ls"),sum("sl"),sum($"PA"),sum($"DS"),
     sum($"DA"),sum("dcur"),sum($"sat"),sum($"Pes"),sum($"Pn"),sum($"Pn"),sum($"Dlo"),sum($"Dol"),sum("pdol"),sum("pct"),sum("judg"))

Note - I am using Spark scala

2 REPLIES 2

Expert Contributor

How long does it take? I'm wondering if you could give us the number, for examples, for your query and a simplifed your query like the following.

q1 = finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(...) // Your query
q2 = finalDF.groupBy($"Dseq", $"FmNum", $"yrs",$"mnt",$"FromDnsty").agg(count($"Dseq"),avg($"Emp"),sum("Ss")) // Simplified your query.

New Contributor

Hi Dongjoon,

My query is used nearly 200 aggregation functions. Ofcourse, if i reduce number of aggregations performance would improve. But if i do simplify query as mentioned above, how many query do i need to create and i am wondering that multiple times group by operation is also involving.

Regards,

Kannan