Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Recommended size for yarn.nodemanager.resource.local-dirs?

avatar
Contributor

Folks,

What is the recommended value for "yarn.nodemanager.resource.local-dirs"?

We only have one value (directory) configured for the above property, which has a size of 200GB.

Our hive jobs' map/reduce fill this folder up, and yarn places this node in the blocklist. Moving to tez engine and/or increasing the quota size may fix this, but we'd like to know the recommended value.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

If you use the same partitions for yarn intermediate data than for the HDFS blocks, then you might also consider setting the fs.datanode.du.reserved property, which reserves some space on those partitions for non-hdfs use (such as intermediate yarn data).

One base recommendation I saw on my first Hadoop training long time ago was to dedicate 25% of the "data disks" for that kind of intermediate data. I guess the optimal answer should consider the maximum amount of intermediate data you can get at the same time (when launching a job, do you use all the data of HDFS as input data?) and dedicate the space for yarn.nodemanager.resource.local-dirs accordingly.

I would also recommend turning on the property mapreduce.map.output.compress in order to reduce the size of the intermediate data.

View solution in original post

2 REPLIES 2

avatar
Rising Star

You would assign one folder to each of the datanode disks, closely mapping dfs.datanode.data.dir. On a 12 disk system you would have 12 yarn local-dir locations.

avatar
Super Collaborator

If you use the same partitions for yarn intermediate data than for the HDFS blocks, then you might also consider setting the fs.datanode.du.reserved property, which reserves some space on those partitions for non-hdfs use (such as intermediate yarn data).

One base recommendation I saw on my first Hadoop training long time ago was to dedicate 25% of the "data disks" for that kind of intermediate data. I guess the optimal answer should consider the maximum amount of intermediate data you can get at the same time (when launching a job, do you use all the data of HDFS as input data?) and dedicate the space for yarn.nodemanager.resource.local-dirs accordingly.

I would also recommend turning on the property mapreduce.map.output.compress in order to reduce the size of the intermediate data.