question Using HDFS as local storage for yarn cluster driver in Support Questions

Using HDFS as local storage for yarn cluster driver

one4like — Fri, 10 Nov 2023 08:41:15 GMT

Hello, im new in Hadoop and just want to know, can i using hdfs as local storage in my Spark driver?

For example: im sending throught Livy a task where

kind":"pyspark" and "code" which contains some operations, that in result should be create some new file.

When i do it in yarn cluster mode, i find that new file was created in a local storage of node with path like: /tmp/hadoop-username/nm-local-dir/usercache/root/appcache......

Can i have any way for set path instead local in hdfs?

I want save my spark results(new created file) in hdfs

When i set spark.local.dir or yarn.nodemanager.local-dirs = hdfs:///temp Livy session just not starting

Mounting HDFS dfs-fuse not seems like the best way.

Or i should use my own fileApp.jar that will be work on each node and each sessions?

Re: Using HDFS as local storage for yarn cluster driver

VidyaSargur — Wed, 15 Nov 2023 09:40:55 GMT

@one4like, Welcome to our community! To help you get the best possible answer, I have tagged our Spark experts @RangaReddy @Babasaheb who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.

Re: Using HDFS as local storage for yarn cluster driver

Babasaheb — Wed, 15 Nov 2023 15:50:33 GMT

Hello @one4like ,

Pushing every local file of a job to HDFS will cause issues, especially in larger clusters. Local directories are used as scratch location. Spills of mappers are written there and moving that over to the network will have performance impacts. The local storage of the scratch files and shuffle files is done exactly to prevent this. It also has security impacts as the NM now pushes the keys for each application on to a network location which could be accessible for others.

A far better solution is to use the fact that the value of yarn.nodemanager.local-dirs can point to multiple mount points and thus spreading the load over all mount points.

So the answer is NO. local-dirs must contain a list of local paths. There's an explicit check in code which only allows local FS to be used.

See here:

https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LocalDirsHandlerService.java#L224 Please note that an exception is thrown when a non local file system is referenced.

If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post.

Thank you.

Bjagtap