Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Super Guru

Having run a bunch of Spark jobs locally, in Spark Standalone clusters and in HDP Yarn Clusters; I have found a few JVM settings that helped with debugging non-production jobs and assist with better Garbage Collection. This is important even with off-heap storage and bare metal optimizations.

spark-submit  --driver-java-options "-XX:+PrintGCDetails -XX:+UseG1GC -XX:MaxGCPauseMillis=400" 

You can also set options extra options in the runtime environment (see Spark Documentation).

For HDP / Spark, you can add this from Ambari.

4361-ambarisparkenv.png

In your Scala Spark Program:

sparkConf.set("spark.cores.max", "4")
sparkConf.set("spark.serializer", classOf[KryoSerializer].getName)
sparkConf.set("spark.sql.tungsten.enabled", "true")
sparkConf.set("spark.eventLog.enabled", "true")
sparkConf.set("spark.app.id", "MyAppIWantToFind")
sparkConf.set("spark.io.compression.codec", "snappy")
sparkConf.set("spark.rdd.compress", "false")
sparkConf.set("spark.suffle.compress", "true")

Make sure you have Tungsten on, the KryoSerializer, eventLog enabled and use Logging.

Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.apache.spark.storage.BlockManager").setLevel(Level.ERROR)

val log = Logger.getLogger("com.hortonworks.myapp")
log.info("Started Logs Analysis")

Also, whenever possible include relevant filters on your datasets: "filter(!_.clientIp.equals("Empty"))".

1,064 Views
0 Kudos
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 12:25 PM
Updated by:
 
Contributors
Top Kudoed Authors