Created 05-18-2016 06:42 PM
Hi Guys,
We have successfully configured Spark on YARN using Ambari on HDP 2.4 with default parameters. However I would like to know what all parameters can we tune for best performance. Should we have separate queues for spark jobs? The use cases are yet to be decided but primarily to replace old MR jobs, experiment with Spark streaming and probably we will also use data frames. How many Spark Thrift Server instances recommended?
Cluster is 20 nodes, each with 256 GB RAM, 36 cores each. Load is generally 5% for other jobs.
Many thanks.
Created 05-18-2016 09:54 PM
Please see Running Spark in Production session from Hadoop Summit, Dublin. See the section on perf tuning.
Created 05-18-2016 06:44 PM
Created 05-18-2016 08:06 PM
Below is an official doc for spark tuning on YARN,
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_spark-guide/content/ch_tuning-spark.html
Generally we see people creates queues to segregate resources b/w different department groups within company or on the basis of number of applications like ETL, real time and so on. Therefore it depends on what your use case is and how you are going to share the cluster resources b/w groups/application. For Spark thrift its better to have single instance within a cluster unless you have 100's of thrift clients running and submitting jobs at same time.
Created 05-18-2016 09:54 PM
Created 05-19-2016 02:13 AM
Created 05-19-2016 04:40 AM
If you have 256 GB/node, leave out at-least 2 GB & 1 core for OS, more if there is something else running on the node. Then start with 5 cores/Executor & 30GB/Ex. So about 7 Executor/node.
Created 05-19-2016 11:37 AM
Thanks @vshukla, @Timothy Spann, @Jitendra Yadav, @Yuta Imai