Created 07-25-2016 09:49 AM
Let's assume I have a 10 node Hadoop cluster with 1 namenode and 9 datanodes(each with different sized disks). What is the limiting factor to the size of the HDFS filesystem, is it limited to the size of the smallest disk on any of the 9 nodes? I am assuming that this should be the case for a default replication factor of three.
Created 07-25-2016 01:02 PM
I think you should read an HDFS book :-). Replication is not dedicated to specific discs.
HDFS will put 3 copies of the replica on different nodes. He doesn't have to choose a specific disc or node. The only rules are that
- All three blocks will be on different nodes
- If you have rack topology enabled the second and third copy will be on a different rack from the first copy
It does not have to be a specific drive or node. HDFS will search for free space on any node that fits the requirements. The only issue I could imagine would be one huge node and some very small nodes that cannot match the size of the other node in total. ( Have seen this with physical nodes and vm nodes. )
Created 07-25-2016 10:12 AM
1. You can have different "datanode configuration groups" created for each of datanode differing in configuration and all datanodes will contribute to HDFS size.
2. If you are worried about replication factor then by default HDFS will take care of it placing replicas randomly on datanodes. Definitely if the size of datanode is less then HDFS will give warning and if its almost filled you will not able to use the datanode for storage.
Let me know if I get you correctly.
Created 07-25-2016 10:44 AM
Thanks for helping me, the information you provided is not unhelpful. I however need to know in a cluster of assorted/different disk sizes what is the limiting factor, is it the size of the smallest disk?
Created 07-25-2016 12:08 PM
Again you don't understand HDFS. There is no limiting factor apart from the total disc capacity of the cluster. HDFS will put blocks ( simply files on the local file system ) on the discs of your cluster and fill them up. It will also make sure that you have 3 copies of each block. There is no limit but the total size of space. Now its not very smart to have differently sized discs in your cluster because this means not all spindles will be utilized equally but there is no limit per se. The performance problem will be that the small drives will be filled up and then all write activity will happen on the bigger drives. So equal drive sizes are recommended. But again not required.
So its recommended to have equally sized discs but its not a requirement. The other discs will not be empty. Also its not a requirement to have the same number of drives in each node but you need to configure each node with the correct number of drives using config groups as sagar said.
Created 07-25-2016 10:44 AM
What Sagar says, Its not like RAID as in that whole discs are mirrored across nodes. Blocks are put on different nodes and HDFS will try to fill up the available space. Its pretty flexible.
Created 07-25-2016 12:21 PM
Thanks team. My question has more to do with the scenario in which replication is not possible on a particular node of the cluster because that disk has filled up.
Created 07-25-2016 01:02 PM
I think you should read an HDFS book :-). Replication is not dedicated to specific discs.
HDFS will put 3 copies of the replica on different nodes. He doesn't have to choose a specific disc or node. The only rules are that
- All three blocks will be on different nodes
- If you have rack topology enabled the second and third copy will be on a different rack from the first copy
It does not have to be a specific drive or node. HDFS will search for free space on any node that fits the requirements. The only issue I could imagine would be one huge node and some very small nodes that cannot match the size of the other node in total. ( Have seen this with physical nodes and vm nodes. )
Created 07-25-2016 01:22 PM
That answered my question. Thank you very much.