What are the pros/cons of mounting more disks per datanode vs having more datanodes
ex: if I have 6 HDDs available for hadoop, what are the pros/cons of deploying 6 datanodes with one disk each vs 1 datanode with 6 disks vs 2 datanodes with 3 disks each etc
Thank you for any help
@Ricky Chen What are you after Compute or Storage? If compute it a good idea to add more data nodes to your cluster with one or two HDD in it for storage whereas if you are just using Hadoop as your data warehouse you can make use of one data node and 6 disks altogether.
If you are after the second scenario make sure your node has appropriate resources for processing that data. Also, hadoop2x version doesn't provide inter disk balancing which is only provided in Hadoop 3.x versions.
2 data node with 3 disk would be an ideal combination which will give you both compute and storage.
thank you for your response
the priority would be storage. The main use case is reading/writing text files along with compressing/decompressing
can you elaborate a bit more by what you mean by inter-disk balancing? are you referring to the round-robining that datanodes will do when multiple directories are listed under the dfs.datanode.data.dir property?
Ideally 3 data nodes with 2 disks is the good option for storage and compute purpose , in case one data node goes down you can relay on other two node . This with HDFS replication factor set to 2 will help to avoid data loss, is safe option . Make sure you follow same filesystem mount points naming on all data nodes. As an example if you name /grid/0 and /grid/1 mount points then follow same naming convention all three node . List /grid/0 and /grid/1 under dfs.datanode.data.dir property.
HDFS disk balancer will take care of cluster wide data balancing.