Support Questions

Find answers, ask questions, and share your expertise

Size of HDFS

avatar
Contributor

Let's assume I have a 10 node Hadoop cluster with 1 namenode and 9 datanodes(each with different sized disks). What is the limiting factor to the size of the HDFS filesystem, is it limited to the size of the smallest disk on any of the 9 nodes? I am assuming that this should be the case for a default replication factor of three.

1 ACCEPTED SOLUTION

avatar
Master Guru

@Kartik Vashishta

I think you should read an HDFS book :-). Replication is not dedicated to specific discs.

HDFS will put 3 copies of the replica on different nodes. He doesn't have to choose a specific disc or node. The only rules are that

- All three blocks will be on different nodes

- If you have rack topology enabled the second and third copy will be on a different rack from the first copy

It does not have to be a specific drive or node. HDFS will search for free space on any node that fits the requirements. The only issue I could imagine would be one huge node and some very small nodes that cannot match the size of the other node in total. ( Have seen this with physical nodes and vm nodes. )

View solution in original post

7 REPLIES 7

avatar
Super Guru
@Kartik Vashishta

1. You can have different "datanode configuration groups" created for each of datanode differing in configuration and all datanodes will contribute to HDFS size.

2. If you are worried about replication factor then by default HDFS will take care of it placing replicas randomly on datanodes. Definitely if the size of datanode is less then HDFS will give warning and if its almost filled you will not able to use the datanode for storage.

Let me know if I get you correctly.

avatar
Contributor

Thanks for helping me, the information you provided is not unhelpful. I however need to know in a cluster of assorted/different disk sizes what is the limiting factor, is it the size of the smallest disk?

avatar
Master Guru

@Kartik Vashishta

Again you don't understand HDFS. There is no limiting factor apart from the total disc capacity of the cluster. HDFS will put blocks ( simply files on the local file system ) on the discs of your cluster and fill them up. It will also make sure that you have 3 copies of each block. There is no limit but the total size of space. Now its not very smart to have differently sized discs in your cluster because this means not all spindles will be utilized equally but there is no limit per se. The performance problem will be that the small drives will be filled up and then all write activity will happen on the bigger drives. So equal drive sizes are recommended. But again not required.

So its recommended to have equally sized discs but its not a requirement. The other discs will not be empty. Also its not a requirement to have the same number of drives in each node but you need to configure each node with the correct number of drives using config groups as sagar said.

avatar
Master Guru

What Sagar says, Its not like RAID as in that whole discs are mirrored across nodes. Blocks are put on different nodes and HDFS will try to fill up the available space. Its pretty flexible.

avatar
Contributor

Thanks team. My question has more to do with the scenario in which replication is not possible on a particular node of the cluster because that disk has filled up.

avatar
Master Guru

@Kartik Vashishta

I think you should read an HDFS book :-). Replication is not dedicated to specific discs.

HDFS will put 3 copies of the replica on different nodes. He doesn't have to choose a specific disc or node. The only rules are that

- All three blocks will be on different nodes

- If you have rack topology enabled the second and third copy will be on a different rack from the first copy

It does not have to be a specific drive or node. HDFS will search for free space on any node that fits the requirements. The only issue I could imagine would be one huge node and some very small nodes that cannot match the size of the other node in total. ( Have seen this with physical nodes and vm nodes. )

avatar
Contributor

That answered my question. Thank you very much.