Created 04-15-2016 06:11 PM
Created 04-15-2016 11:19 PM
All nodes:
Enough Space for logs /var/log ( 100GB? ), Also enough space for /var and /usr
At least the logs should have their own logical partition since its annoying when they run over.
Namenode:
Discs should be raided. Good best practice to keep a separate partition for the hadoop files ( /hadoop? ) A couple hundred GB should be sufficient. The disc requirements are not huge.
DataNodes:
2 OS discs can be raided for additional resiliency. Or just one drive reserved for OS for more data drives.
All other drives are data drives and should be non raided and simply added as simple volumes
/grid/x
ext4 is good for the data drives discs should be mounted with noatime.
Some more details here:
https://community.hortonworks.com/articles/14508/best-practices-linux-file-systems-for-hdfs.html
Swap: in general disable swap on datanodes since swapping should REALLY not happen on the datanodes and would most likely kill cluster performance. Better to have tasks fail and someone to look at it
On Master nodes its a bit more complex. Here depending on cluster size many tasks can be running and OOM errors can lead to unpredicable results. So swapping may be safe here. However make sure that yoy normally have enough space available.
But you may find other recommendations as well.
Created 04-15-2016 11:19 PM
All nodes:
Enough Space for logs /var/log ( 100GB? ), Also enough space for /var and /usr
At least the logs should have their own logical partition since its annoying when they run over.
Namenode:
Discs should be raided. Good best practice to keep a separate partition for the hadoop files ( /hadoop? ) A couple hundred GB should be sufficient. The disc requirements are not huge.
DataNodes:
2 OS discs can be raided for additional resiliency. Or just one drive reserved for OS for more data drives.
All other drives are data drives and should be non raided and simply added as simple volumes
/grid/x
ext4 is good for the data drives discs should be mounted with noatime.
Some more details here:
https://community.hortonworks.com/articles/14508/best-practices-linux-file-systems-for-hdfs.html
Swap: in general disable swap on datanodes since swapping should REALLY not happen on the datanodes and would most likely kill cluster performance. Better to have tasks fail and someone to look at it
On Master nodes its a bit more complex. Here depending on cluster size many tasks can be running and OOM errors can lead to unpredicable results. So swapping may be safe here. However make sure that yoy normally have enough space available.
But you may find other recommendations as well.
Created 04-15-2016 11:27 PM
Instead of swap, can we use tmpfs ?
Created 04-16-2016 10:44 AM
What would you want to speed up with tmpfs? Most components in the hadoop environment only use the discs for persistence ( or require A LOT of space on the datanodes ) So a tmpfs store defeats the purpose for something like an fsimage etc.
The two components who aggressively use memory backed discs are Spark and Kafka. But both depend on OS filesystem buffers instead ( and tell the OS only to write through to disc if needed )
Created 10-29-2017 04:56 PM
tmpfs is from RAM anyway, so if you already needed to swap out to swap partition, you won't have any space in RAM to spill to the tmpfs anyway.
tmpfs is used when you have huge RAM available and also you need to cache something fast and ephemeral, then tmpfs allows you to mount some amount of RAM into filesystem. so you can use it as if it's a FS mount.
Created 04-16-2016 04:12 AM
To add to what @Benjamin Leonhardi mentioned, the good doc to start with is cluster planning guide(Refer Page No 7).
A 12 page doc with loads of information.
Regarding Swap, Here is the recomendation.
Hope this helps