Support Questions

Find answers, ask questions, and share your expertise

How to change Spark _temporary directory when writing data?

avatar
Rising Star

I have two spark applications writing data to one directory on HDFS, which cause the faster completed app will delete the working directory _temporary containing some temp file belonging to another app.

So can I specify a _temporary directory for each Spark application?

6 REPLIES 6

avatar
Master Collaborator

@Junfeng Chen

You can change the path to the temp folder for each Spark application by spark.local.dir property like below

SparkConf conf = new SparkConf().setMaster("local”).setAppName("test”).set("spark.local.dir", "/tmp/spark-temp");

Reference
Please accept the answer you found most useful

avatar
Rising Star

Thanks @Jagadeesan A S

_temporary is a temp directory under path of the df.write.parquet(path) on hdfs. However spark.local.dir default value is /tmp, and in document,

Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system.

So it should be a directory on local file system. I am not sure spark.local.dir refers to the temp directory of spark writing ...

avatar
Master Collaborator

@Junfeng Chen

That's true, above property for local filesystem. For hdfs could you try to use Append instead of Overwrite ? But problem in this, we need to delete files manually from the temp directory.

avatar
Rising Star

Hi @Jagadeesan A S

my current save mode is append. My sparking streaming apps will run every 5 min, it is not convenient to delete manually....So I think the better solution is customize the temp location.

Or Can I set offset of the scheduled running time? For example, my current 2 apps every 5 minutes, that's run at 0, 5, 10, 15, 20

Can I set a schedule, make one still runs at 0, 5, 10 , 15, and another runs at 2.5, 7.5, 10.5?

avatar
New Contributor

Did you ever figure out the solution? I am facing the same issue

avatar
Contributor

Hi @Siddu198 

Add this config to your job:

set("mapreduce.fileoutputcommitter.algorithm.version","2")