Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Spark Optimization

Spark Optimization

New Contributor

We are currently working on POC based on Spark and Scala.
we have to read 18million records from parquet file and perform the 25 user defined aggregation based on grouping keys.
we have used spark high level Dataframe API for the aggregation. On cluster of two node we could finish end to end job ((Read+Aggregation+Write))in 2 min

Cluster Information:
Number of Node:2
Total Core:28Core
Total RAM:128GB

Tuning Parameter:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.default.parallelism 24
spark.sql.shuffle.partitions 24
spark.executor.extraJavaOptions -XX:+UseG1GC
spark.speculation true
spark.executor.memory 16G
spark.driver.memory 8G
spark.sql.codegen true
spark.sql.inMemoryColumnarStorage.batchSize 100000
spark.locality.wait 1s
spark.ui.showConsoleProgress false
Please let us know, If you have any ideas/tuning parameter that we can use to finish the job in less than one min.