Created 11-10-2023 12:41 AM
Hello, im new in Hadoop and just want to know, can i using hdfs as local storage in my Spark driver?
For example: im sending throught Livy a task where
Created 11-15-2023 07:50 AM
Hello @one4like ,
Pushing every local file of a job to HDFS will cause issues, especially in larger clusters. Local directories are used as scratch location. Spills of mappers are written there and moving that over to the network will have performance impacts. The local storage of the scratch files and shuffle files is done exactly to prevent this. It also has security impacts as the NM now pushes the keys for each application on to a network location which could be accessible for others.
A far better solution is to use the fact that the value of yarn.nodemanager.local-dirs can point to multiple mount points and thus spreading the load over all mount points.
So the answer is NO. local-dirs must contain a list of local paths. There's an explicit check in code which only allows local FS to be used.
See here:
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoo... Please note that an exception is thrown when a non local file system is referenced.
If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post.
Thank you.
Bjagtap
Created 11-15-2023 01:40 AM
@one4like, Welcome to our community! To help you get the best possible answer, I have tagged our Spark experts @RangaReddy @Babasaheb who may be able to assist you further.
Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.
Regards,
Vidya Sargur,Created 11-15-2023 07:50 AM
Hello @one4like ,
Pushing every local file of a job to HDFS will cause issues, especially in larger clusters. Local directories are used as scratch location. Spills of mappers are written there and moving that over to the network will have performance impacts. The local storage of the scratch files and shuffle files is done exactly to prevent this. It also has security impacts as the NM now pushes the keys for each application on to a network location which could be accessible for others.
A far better solution is to use the fact that the value of yarn.nodemanager.local-dirs can point to multiple mount points and thus spreading the load over all mount points.
So the answer is NO. local-dirs must contain a list of local paths. There's an explicit check in code which only allows local FS to be used.
See here:
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoo... Please note that an exception is thrown when a non local file system is referenced.
If you found this response assisted with your query, please take a moment to log in and click on KUDOS 🙂 & ”Accept as Solution" below this post.
Thank you.
Bjagtap