Community Articles
Find and share helpful community-sourced technical articles
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

Having run a bunch of Spark jobs locally, in Spark Standalone clusters and in HDP Yarn Clusters; I have found a few JVM settings that helped with debugging non-production jobs and assist with better Garbage Collection. This is important even with off-heap storage and bare metal optimizations.

spark-submit  --driver-java-options "-XX:+PrintGCDetails -XX:+UseG1GC -XX:MaxGCPauseMillis=400" 

You can also set options extra options in the runtime environment (see Spark Documentation).

For HDP / Spark, you can add this from Ambari.


In your Scala Spark Program:

sparkConf.set("spark.cores.max", "4")
sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)
sparkConf.set("spark.sql.tungsten.enabled", "true")
sparkConf.set("spark.eventLog.enabled", "true")
sparkConf.set("", "MyAppIWantToFind")
sparkConf.set("", "snappy")
sparkConf.set("spark.rdd.compress", "false")
sparkConf.set("spark.suffle.compress", "true")

Make sure you have Tungsten on, the KryoSerializer, eventLog enabled and use Logging.


val log = Logger.getLogger("com.hortonworks.myapp")"Started Logs Analysis")

Also, whenever possible include relevant filters on your datasets: "filter(!_.clientIp.equals("Empty"))".

0 Kudos
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 12:25 PM
Updated by:
Top Kudoed Authors