I have two spark applications writing data to one directory on HDFS, which cause the faster completed app will delete the working directory _temporary containing some temp file belonging to another app.
So can I specify a _temporary directory for each Spark application？
Thanks @Jagadeesan A S
_temporary is a temp directory under path of the df.write.parquet(path) on hdfs. However spark.local.dir default value is /tmp, and in document,
Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system.
So it should be a directory on local file system. I am not sure spark.local.dir refers to the temp directory of spark writing ...
That's true, above property for local filesystem. For hdfs could you try to use Append instead of Overwrite ? But problem in this, we need to delete files manually from the temp directory.
my current save mode is append. My sparking streaming apps will run every 5 min, it is not convenient to delete manually....So I think the better solution is customize the temp location.
Or Can I set offset of the scheduled running time? For example, my current 2 apps every 5 minutes, that's run at 0, 5, 10, 15, 20
Can I set a schedule, make one still runs at 0, 5, 10 , 15, and another runs at 2.5, 7.5, 10.5?