<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Scala - Spark Boost GroupBy Computing for multiple Dimensions in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239536#M85827</link>
    <description>&lt;A rel="user" href="https://community.cloudera.com/users/103246/giorgionasis.html" nodeid="103246"&gt;@GEORGE NASIS&lt;/A&gt;&lt;P&gt;In case of caluculating average we are not going to get same value.&lt;BR /&gt;&lt;STRONG&gt;&lt;U&gt;&lt;EM&gt;Example:&lt;/EM&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Hive table data:&lt;/STRONG&gt; &lt;/P&gt;&lt;PRE&gt;select * from i;
+---------+------+--------+-------+--+
| gender  | age  | total  | name  |
+---------+------+--------+-------+--+
| M       | 10   | 12     | st    |
| F       | 10   | 8      | st    |
| M       | 12   | 15     | st    |
+---------+------+--------+-------+--+&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Caluculate Avg on one field:&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;select avg(total) from i group by name;
+---------------------+--+
|         _c0         |
+---------------------+--+
| 11.666666666666666  |
+---------------------+--+&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Calculate avg using subquery:&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;select avg(tt) from (
        select name,avg(total)tt from i group by gender,name)t &lt;BR /&gt;group by name;
+--------+--+
|  _c0   |
+--------+--+
| 10.75  |
+--------+--+&lt;/PRE&gt;&lt;P&gt;as we are doing 2 dimensions in innerquery then 1D group by on outer query we can see the difference between averages because number of rows are 2 in outer query instead of 3.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&lt;U&gt;Correct subquery would be:&lt;/U&gt;&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;select sum(tt)/sum(cnt) from(&lt;BR /&gt;                   select name,sum(total)tt,count(*)cnt,avg(total) from i group by gender,name //add cnt column,sum,avg)t &lt;BR /&gt;group by name;
+---------------------+--+
|         _c0         |
+---------------------+--+
| 11.666666666666666  |
+---------------------+--+&lt;/PRE&gt;&lt;P&gt;You can try with this approach (adding&lt;STRONG&gt; count column then sum(total) divide sum(count)&lt;/STRONG&gt; to get correct avg results) and cache the most appropriate dataframe then spark optimizer will choose the most efficient plan to run these tasks.&lt;/P&gt;</description>
    <pubDate>Sat, 29 Dec 2018 03:23:44 GMT</pubDate>
    <dc:creator>Shu_ashu</dc:creator>
    <dc:date>2018-12-29T03:23:44Z</dc:date>
    <item>
      <title>Scala - Spark Boost GroupBy Computing for multiple Dimensions</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239533#M85824</link>
      <description>&lt;P&gt;My goal is to create a Cube of 4 Dimensions and 1 Measure. 
This means I have in total 16 GroupBy`s to compute. &lt;/P&gt;&lt;P&gt;
In my code you can see the 4 Dimensions (Gender,Age,TotalChildren,ProductCategoryName) and the Measure TotalCost. &lt;/P&gt;&lt;P&gt;I have filter all my columns to drop any row that it is null. 
After that I compute every GroupBy one by one and then I use coalesce() to bind the csv`s into one file. &lt;/P&gt;&lt;P&gt;
All the process takes about 10 minutes which I think is too much. 
Is there any way to enhance the process? Maybe by computing some groupby`s from others?
Also my data is about 5GB so if I read it 16 times as the number of groupby`s this means in total 80GB. &lt;/P&gt;&lt;P&gt;Thank you very much&lt;/P&gt;&lt;P&gt;Here is my Code &lt;/P&gt;&lt;PRE&gt;import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.udf


object ComputeCube {

def main(args:Array[String]):Unit= {

val spark: SparkSession = SparkSession.builder()
  .master("local[*]")
  .appName("SparkProject2018")
  .getOrCreate()

import spark.implicits._

val filePath="src/main/resources/dataspark.txt"

val df = spark.read.options(Map("inferSchema"-&amp;gt;"true","delimiter"-&amp;gt;"|","header"-&amp;gt;"true"))
  .csv(filePath).select("Gender", "BirthDate", "TotalCost", "TotalChildren", "ProductCategoryName")

val df2 = df
  .filter("Gender is not null")
  .filter("BirthDate is not null")
  .filter("TotalChildren is not null")
  .filter("ProductCategoryName is not null")

val currentDate = udf{ (dob: java.sql.Date) =&amp;gt;
  import java.time.{LocalDate, Period}
  Period.between(dob.toLocalDate, LocalDate.now).getYears
}

val df3 = df2.withColumn("Age", currentDate($"BirthDate"))


val groupByAll = df3.groupBy("Gender","Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

val groupByGenderAndAgeAndTotalChildren = df3.groupBy("Gender","Age", "TotalChildren").avg("TotalCost")

val groupByGenderAndAgeAndProductCategoryName = df3.groupBy("Gender","Age", "ProductCategoryName" ).avg("TotalCost")

val groupByGenderAndTotalChildrenAndProductCategoryName = df3.groupBy("Gender", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

val groupByAgeAndTotalChildrenAndProductCategoryName = df3.groupBy("Age", "TotalChildren", "ProductCategoryName" ).avg("TotalCost")

val groupByGenderAndAge = df3.groupBy("Gender","Age").avg("TotalCost")

val groupByGenderAndTotalChildren = df3.groupBy("Gender","TotalChildren").avg("TotalCost")

val groupByGenderAndProductCategoryName = df3.groupBy("Gender","ProductCategoryName" ).avg("TotalCost")

val groupByAgeAndTotalChildren = df3.groupBy("Age","TotalChildren").avg("TotalCost")

val groupByAgeAndProductCategoryName = df3.groupBy("Age","ProductCategoryName" ).avg("TotalCost")

val groupByTotalChildrenAndProductCategoryName = df3.groupBy("TotalChildren","ProductCategoryName" ).avg("TotalCost")

val groupByGender = df3.groupBy("Gender").avg("TotalCost")

val groupByAge = df3.groupBy("Age").avg("TotalCost")

val groupByTotalChildren = df3.groupBy("TotalChildren" ).avg("TotalCost")

val groupByProductCategoryName = df3.groupBy("ProductCategoryName" ).avg("TotalCost")

val groupByNone = df3.groupBy().avg("TotalCost")


groupByAll.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/All.csv")

groupByGenderAndAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Gender_Age_TotalChildren.csv")

groupByGenderAndAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Gender_Age_ProductCategoryName.csv")

groupByGenderAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Gender_TotalChildren_ProductCategoryName.csv")

groupByAgeAndTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Age_TotalChildren_ProductCategoryName.csv")

groupByGenderAndAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Gender_Age.csv")

groupByGenderAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Gender_TotalChildren.csv")

groupByGenderAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Gender_ProductCategoryName.csv")

groupByAgeAndTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Age_TotalChildren.csv")

groupByAgeAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Age_ProductCategoryName.csv")

groupByTotalChildrenAndProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/TotalChildren_ProductCategoryName.csv")

groupByGender.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Gender.csv")

groupByAge.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/Age.csv")

groupByTotalChildren.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/TotalChildren.csv")

groupByProductCategoryName.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/ProductCategoryName.csv")

groupByNone.coalesce(1).write.format("csv").option("delimiter", "|").option("header", "true")
  .mode("overwrite").save("src/main/resources/None.csv")

  }
 }
&lt;/PRE&gt;</description>
      <pubDate>Fri, 28 Dec 2018 08:40:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239533#M85824</guid>
      <dc:creator>giorgionasis</dc:creator>
      <dc:date>2018-12-28T08:40:05Z</dc:date>
    </item>
    <item>
      <title>Re: Scala - Spark Boost GroupBy Computing for multiple Dimensions</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239534#M85825</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/103246/giorgionasis.html" nodeid="103246"&gt;@GEORGE NASIS&lt;/A&gt;&lt;/P&gt;&lt;P&gt;As you are using df3 dataframe for all groupBy clauses, so &lt;A href="https://spark.apache.org/docs/latest/quick-start.html#caching" target="_blank"&gt;cache&lt;/A&gt; the &lt;STRONG&gt;"df3" dataframe&lt;/STRONG&gt; then spark will not recompute the data from file for all 16 times instead uses the cached dataframe(df3) and This will &lt;B&gt;significantly increases the performance&lt;/B&gt;.&lt;/P&gt;&lt;PRE&gt;val df3 = df2.withColumn("Age", currentDate($"BirthDate"))
df3.cache() //caching the dataframe into memory&lt;BR /&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 28 Dec 2018 11:43:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239534#M85825</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-12-28T11:43:24Z</dc:date>
    </item>
    <item>
      <title>Re: Scala - Spark Boost GroupBy Computing for multiple Dimensions</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239535#M85826</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/18929/yaswanthmuppireddy.html" nodeid="18929"&gt;@Shu&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Thank you a lot!It worked like a charm!Shame on me I did not think it.&lt;/P&gt;&lt;P&gt;However, don`t you find it bad practice to compute the groubpy`s separately? I mean shouldn`t I reuse the groupby of the 4 dimensions to compute the groupby`s of the three dimensions?&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Fri, 28 Dec 2018 17:31:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239535#M85826</guid>
      <dc:creator>giorgionasis</dc:creator>
      <dc:date>2018-12-28T17:31:08Z</dc:date>
    </item>
    <item>
      <title>Re: Scala - Spark Boost GroupBy Computing for multiple Dimensions</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239536#M85827</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/103246/giorgionasis.html" nodeid="103246"&gt;@GEORGE NASIS&lt;/A&gt;&lt;P&gt;In case of caluculating average we are not going to get same value.&lt;BR /&gt;&lt;STRONG&gt;&lt;U&gt;&lt;EM&gt;Example:&lt;/EM&gt;&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Hive table data:&lt;/STRONG&gt; &lt;/P&gt;&lt;PRE&gt;select * from i;
+---------+------+--------+-------+--+
| gender  | age  | total  | name  |
+---------+------+--------+-------+--+
| M       | 10   | 12     | st    |
| F       | 10   | 8      | st    |
| M       | 12   | 15     | st    |
+---------+------+--------+-------+--+&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Caluculate Avg on one field:&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;select avg(total) from i group by name;
+---------------------+--+
|         _c0         |
+---------------------+--+
| 11.666666666666666  |
+---------------------+--+&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Calculate avg using subquery:&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;select avg(tt) from (
        select name,avg(total)tt from i group by gender,name)t &lt;BR /&gt;group by name;
+--------+--+
|  _c0   |
+--------+--+
| 10.75  |
+--------+--+&lt;/PRE&gt;&lt;P&gt;as we are doing 2 dimensions in innerquery then 1D group by on outer query we can see the difference between averages because number of rows are 2 in outer query instead of 3.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&lt;U&gt;Correct subquery would be:&lt;/U&gt;&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;select sum(tt)/sum(cnt) from(&lt;BR /&gt;                   select name,sum(total)tt,count(*)cnt,avg(total) from i group by gender,name //add cnt column,sum,avg)t &lt;BR /&gt;group by name;
+---------------------+--+
|         _c0         |
+---------------------+--+
| 11.666666666666666  |
+---------------------+--+&lt;/PRE&gt;&lt;P&gt;You can try with this approach (adding&lt;STRONG&gt; count column then sum(total) divide sum(count)&lt;/STRONG&gt; to get correct avg results) and cache the most appropriate dataframe then spark optimizer will choose the most efficient plan to run these tasks.&lt;/P&gt;</description>
      <pubDate>Sat, 29 Dec 2018 03:23:44 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Scala-Spark-Boost-GroupBy-Computing-for-multiple-Dimensions/m-p/239536#M85827</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-12-29T03:23:44Z</dc:date>
    </item>
  </channel>
</rss>

