Community Articles

TimothySpann · ‎05-19-2016

Having run a bunch of Spark jobs locally, in Spark Standalone clusters and in HDP Yarn Clusters; I have found a few JVM settings that helped with debugging non-production jobs and assist with better Garbage Collection. This is important even with off-heap storage and bare metal optimizations.

spark-submit  --driver-java-options "-XX:+PrintGCDetails -XX:+UseG1GC -XX:MaxGCPauseMillis=400"

You can also set options extra options in the runtime environment (see Spark Documentation).

For HDP / Spark, you can add this from Ambari.

In your Scala Spark Program:

sparkConf.set("spark.cores.max", "4")
sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)
sparkConf.set("spark.sql.tungsten.enabled", "true")
sparkConf.set("spark.eventLog.enabled", "true")
sparkConf.set("spark.app.id", "MyAppIWantToFind")
sparkConf.set("spark.io.compression.codec", "snappy")
sparkConf.set("spark.rdd.compress", "false")
sparkConf.set("spark.suffle.compress", "true")

Make sure you have Tungsten on, the KryoSerializer, eventLog enabled and use Logging.

Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR)

val log = Logger.getLogger("com.hortonworks.myapp")
log.info("Started Logs Analysis")

Also, whenever possible include relevant filters on your datasets: "filter(!_.clientIp.equals("Empty"))".

Cloudera Community

Community Articles

Spark 1.6 Tips in Code and Submission

Apache Spark