I'm running an application with spark-submit. The application uses both Scala and Java. The spark-submit specifies the location of the jar file with --jars
A strange phenomenon I'm seeing - even though I make modifications to my Java files and build new jar files, the cluster sometimes uses my older jar files. It is as if the cluster has a cached copy of my old jar file.
Can someone please educate me on where to look for older or cached jar files and clean them up?
ps: I'm using Cloudera 5.5.1, with Spark 1.5.0
What yarn mode are you using yarn-client or yarn-cluster? Where is the jar you are trying to load, is it local to the driver or in hdfs? Are you shutting down the spark context or trying to add the jar programatically?
Look for log messages, in the driver you will see "Added JAR" and then your jar file name, you will see an error if there was an issue loading your jar, if it already existed in the spark file server. In the containers you will find messages like "Fetching", "Copying", and "Adding file". If the jar file is cached, the "Fetching" message will be missing. There is also an overwrite option that will delete and replace the files if it exists.
It is sometimes useful to version your jars which will make it easier to determine if an older version is being used or not.
You can use the configuration "spark.files.overwrite" to control whether files distributed through spark will be overwritten. Please see executor configuration documenation for default behavior, but with default is currently to not overwrite files.