04-26-2018 09:21 AM - last edited on 04-26-2018 12:18 PM by cjervis
We are currently working on POC based on Spark and Scala.
we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys.
we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min
Number of Node:2
Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min.