Support Questions

Find answers, ask questions, and share your expertise

Using HDFS as local storage for yarn cluster driver

avatar
New Contributor

Hello, im new in Hadoop and just want to know, can i using hdfs as local storage in my Spark driver?

For example: im sending throught Livy a task where

kind":"pyspark" and "code" which contains some operations, that in result should be create some new file.
When i do it in yarn cluster mode, i find that new file was created in a local storage of node with path like: /tmp/hadoop-username/nm-local-dir/usercache/root/appcache......
Can i have any way for set path instead local in hdfs?
I want save my spark results(new created file) in hdfs
When i set spark.local.dir or yarn.nodemanager.local-dirs = hdfs:///temp Livy session just not starting
Mounting HDFS dfs-fuse not seems like the best way.
Or i should use my own fileApp.jar that will be work on each node and each sessions?
 
1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hello @one4like ,

Pushing every local file of a job to HDFS will cause issues, especially in larger clusters. Local directories are used as scratch location. Spills of mappers are written there and moving that over to the network will have performance impacts. The local storage of the scratch files and shuffle files is done exactly to prevent this. It also has security impacts as the NM now pushes the keys for each application on to a network location which could be accessible for others.

A far better solution is to use the fact that the value of yarn.nodemanager.local-dirs can point to multiple mount points and thus spreading the load over all mount points.

So the answer is NO. local-dirs must contain a list of local paths. There's an explicit check in code which only allows local FS to be used.

See here:

https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoo... Please note that an exception is thrown when a non local file system is referenced.

If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post.

Thank you.

Bjagtap

View solution in original post

2 REPLIES 2

avatar
Community Manager

@one4like, Welcome to our community! To help you get the best possible answer, I have tagged our Spark experts @RangaReddy  @Babasaheb who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Super Collaborator

Hello @one4like ,

Pushing every local file of a job to HDFS will cause issues, especially in larger clusters. Local directories are used as scratch location. Spills of mappers are written there and moving that over to the network will have performance impacts. The local storage of the scratch files and shuffle files is done exactly to prevent this. It also has security impacts as the NM now pushes the keys for each application on to a network location which could be accessible for others.

A far better solution is to use the fact that the value of yarn.nodemanager.local-dirs can point to multiple mount points and thus spreading the load over all mount points.

So the answer is NO. local-dirs must contain a list of local paths. There's an explicit check in code which only allows local FS to be used.

See here:

https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoo... Please note that an exception is thrown when a non local file system is referenced.

If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post.

Thank you.

Bjagtap