Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

DataNodes to disk ratio

Highlighted

DataNodes to disk ratio

New Contributor

What are the pros/cons of mounting more disks per datanode vs having more datanodes


ex: if I have 6 HDDs available for hadoop, what are the pros/cons of deploying 6 datanodes with one disk each vs 1 datanode with 6 disks vs 2 datanodes with 3 disks each etc


Thank you for any help

3 REPLIES 3

Re: DataNodes to disk ratio

Contributor

@Ricky Chen What are you after Compute or Storage? If compute it a good idea to add more data nodes to your cluster with one or two HDD in it for storage whereas if you are just using Hadoop as your data warehouse you can make use of one data node and 6 disks altogether.

If you are after the second scenario make sure your node has appropriate resources for processing that data. Also, hadoop2x version doesn't provide inter disk balancing which is only provided in Hadoop 3.x versions.

2 data node with 3 disk would be an ideal combination which will give you both compute and storage.

Re: DataNodes to disk ratio

New Contributor

@Sandeep Kumar

thank you for your response

the priority would be storage. The main use case is reading/writing text files along with compressing/decompressing

can you elaborate a bit more by what you mean by inter-disk balancing? are you referring to the round-robining that datanodes will do when multiple directories are listed under the dfs.datanode.data.dir property?

Re: DataNodes to disk ratio

New Contributor

@Ricky Chen

Ideally 3 data nodes with 2 disks is the good option for storage and compute purpose , in case one data node goes down you can relay on other two node . This with HDFS replication factor set to 2 will help to avoid data loss, is safe option . Make sure you follow same filesystem mount points naming on all data nodes. As an example if you name /grid/0 and /grid/1 mount points then follow same naming convention all three node . List /grid/0 and /grid/1 under dfs.datanode.data.dir property.

HDFS disk balancer will take care of cluster wide data balancing.