Created 08-14-2016 03:31 PM
In Hadoop1, horizontal scaling of NameNode was not possible. Due to this there was a limit to which one could perform Horizontal scaling for DataNodes in the cluster.
Hadoop2 allows horizontal scaling of NameNodes using a technique called HDFS federation. Without HDFS federation, one cannot add more data nodes to the hadoop cluster beyond a certain limit. In other words, a single namenode's RAM will become fully occupied with metadata/namespace information used for managing the entire datanodes in a cluster. It is at this point that addition of new datanodes, i.e horizontal scaling of data nodes became impossible. Because, namenodes's RAM is already full and cannot handle the new set of metadata/namespace informations, generated as a result of managing the additional data nodes in the cluster. So the limiting factor is the memory capacity of a NameNodes RAM. HDFS federation comes to its rescue by dividing the metadata/namespace across multiple Namenodes, thereby offloading part of one application's metadata/namespace information that have grown beyond the capacity of a single NameNode's RAM to a second NameNode's RAM.
Is my understanding correct ??
If YES, then as per documentation, it says that NameNodes are independent and does not communicate each other. In that case who is managing an application's metadata/namespace informations that are resident between two NameNodes? Some sort of daemon should be there that communicates with all NameNodes and finding the portions of metadata split between NameNodes.
If NO, then one application's metadata/namespace information does not get split, rather it stays in the same NameNode's RAM only. The next namenode will be used only when the system visualises a situation where an applications's metadata/namespace informations cannot be fully accomodated in its RAM for successfully executing all of its MapReduce jobs. If it does not fit, then system has to try a NameNode whose RAM capacity is sufficient enough to load the job. So my take here is HDFS federation is just a work around and has not actually solved the issue.
Please confirm.
Created 08-16-2016 07:05 AM
"If, say for example, the kind of input file that we want to process is immensely huge such that the inputs splits generated for that file are accordingly huge in number, and the resulting meatadata information for the storage references of the entire input splits in HDFS grows beyond the RAM capacity."
Dude. Calm down :)...seriously that is some imagination. I must give you that (At this point I am assuming your question is more academic than work related - I could of course be wrong)
One namenode object uses about 150 bytes to store metadata information. Assume a 128 MB block size - you should increase the block size for the case you describe.
Assume a file size 150 MB. The file will be split in two blocks. First block with 128 MB and second block with 22MB. For each block following information will be stored by Namenode.
1 file inode and 2 blocks.
That is 3 namenode objects. They will take about 450 bytes on namenode. For example, at 1MB block size, in this case we will have 150 file blocks. We will have one inode and 150 blocks information in namenode. This means 151 namenode objects for same data. 151 x 150 bytes = 22650 bytes. Even worse would be to have 150 files with 1MB each. That would require 150 inodes and 150 blocks = 300 x 150 bytes = 45000 bytes. See how this all changes. That's why we don't recommend small files for Hadoop.
Now assuming 128 MB file blocks, on average 1GB of memory is required for 1 million blocks.
Now let's do this calculation at PB scale.
Assume 6000 TB of data. That's a lot. I am sure your large file is less than that.
Imagine 30 TB capacity for each node. this will require 200 nodes.
At 128 MB block size, and replication factor of 3.
Cluster capacity in MB = 30 x 1000 (convert to GB) x 1000 (convert to MB) x 200 nodes = 6 000000000 MB (6000 TB)
How many blocks can we store in this cluster?
6 000 000 000 MB/128 MB = 46875000 (that's 47 million blocks)
Assume 1 GB of memory required per million blocks, you need a mere 46875000 blocks / 1000000 blocks per GB = 46 GB of memory.
Namenodes with 64-128 GB memory are quite common. You can do a few things here.
1. Increase the block size to 256 MB and that will save you quite a bit of namenode space. At the scale you are talking about, you should do that regardless. May be even 384-512 MB.
2. Get more memory for name node. Probably 256 GB or even 512 GB servers.
Finally, read the following.
https://issues.apache.org/jira/browse/HADOOP-1687
and following (notice for 40-50 million files only 24 GB is recommended - half of our calculations. Probably because block size assumed at that scale is 256 MB rather than 128 MB)
Created 08-14-2016 06:23 PM
Created 08-14-2016 06:36 PM
Please see my replies below:
1. So the limiting factor is the memory capacity of a NameNodes RAM. HDFS federation comes to its rescue by dividing the metadata/namespace across multiple Namenodes, thereby offloading part of one application's metadata/namespace information that have grown beyond the capacity of a single NameNode's RAM to a second NameNode's RAM.
Is my understanding correct ??
Answer: Yes
2. If YES, then as per documentation, it says that NameNodes are independent and does not communicate each other. In that case who is managing an application's metadata/namespace informations that are resident between two NameNodes?
Answer: No one. Client needs to know the namespace they are connecting to and hence the name node. There are multiple name services in HDFS federation. Your client applications even without HDFS federation today, connect to namenode using a nameservice. They will do the same thing when HDFS federation is enabled and they just need to know which nameservice they are connecting to.
Now, the question is how would Hive or other client tools know where the hive metastore is. What if there are external tables. When you enable HDFS federation, you will use what's called ViewFS which enables you to manage multiple namespaces by mounting different file system locations in different namenodes to their logical mount points similar to linux mount tables. I would highly recommend reading following two links (I like the first link better).
http://thriveschool.blogspot.com/2014/07/hdfs-federation.html
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/ViewFs.html
Created 08-15-2016 12:24 AM
Hi @mqureshi. How are the clients divided up between the namenodes? Can the whole cluster still interact fully? E.g. if pig and hive connect to different nameservices/namenodes can they still operate on the same data on HDFS?
Created 08-15-2016 01:28 AM
That's a great question. I have updated the answer. You would basically use ViewFS which uses mount tables similar to linux to solve the problem of using relative paths in different namespaces without the need of specifying the namenode uri.
I must admit that it gets ugly. That's why unless you have thousands of nodes, this should ideally be avoided. Please share your motivations on considering Federation. May be there is a better and cleaner solution.
Created 08-18-2016 05:32 AM
Hi @mqureshi. Thanks for your response. Personally I have no motivation to use Federation I am just curious about it as I see it mentioned occasionally, and I hadn't really come across a concrete example of its practical application and how that would work.
Created 08-16-2016 05:55 AM
Hi @mqureshi
Understood that the client needs to know upfront the name of the namespace that it want to work with. In other words, the namepace containing the data blocks which are essential for parallely distribute the map task for a particular use case. So the naming of namespace comes into the Hadoop scenes at the time of storing the huge data into the common HDFS storage cluster. The input data will be split into input splits and gets stored in the name of several blocks in each datanodes. All the meatadata informations, that are getting generated as result of storing the input splits into various blocks in the Datanode cluster, will get added to the namespace in the corresponding NameNode. If, say for example, the kind of input file that we want to process is immensely huge such that the inputs splits generated for that file are accordingly huge in number, and the resulting meatadata information for the storage references of the entire input splits in HDFS grows beyond the RAM capacity. In such a scenario, do we have any other option other than the below? Divide the input file such that the metadata informations size in a NameNode are allowed to grow only within the RAM capacity. For example we have Two NameNodes(NN1,NN2) in our cluster which are federated as following NN1 have 8GB RAM NN2 have 6GB RAM Our original input file is so huge that, lets say, its metadata informations is going to occupy 10GB of RAM during HDFS storage. As we know we cannot proceed storing this file in HDFS under a single namespace without partitioning the input file manually. We go by dividing the input file into two equal proportions, such that the first part's metadata references(created during HDFS storage in NN1) occupy only memory less than 8GB RAM,i.e 5GB and the second part's metadata references(created during HDFS storage in NN2) will occupy only memory less than 6GB RAM,i.e 5GB Do we have an auto detection of such use cases and avoid the manual intervention of partitioning the input data here?
Created 08-16-2016 07:05 AM
"If, say for example, the kind of input file that we want to process is immensely huge such that the inputs splits generated for that file are accordingly huge in number, and the resulting meatadata information for the storage references of the entire input splits in HDFS grows beyond the RAM capacity."
Dude. Calm down :)...seriously that is some imagination. I must give you that (At this point I am assuming your question is more academic than work related - I could of course be wrong)
One namenode object uses about 150 bytes to store metadata information. Assume a 128 MB block size - you should increase the block size for the case you describe.
Assume a file size 150 MB. The file will be split in two blocks. First block with 128 MB and second block with 22MB. For each block following information will be stored by Namenode.
1 file inode and 2 blocks.
That is 3 namenode objects. They will take about 450 bytes on namenode. For example, at 1MB block size, in this case we will have 150 file blocks. We will have one inode and 150 blocks information in namenode. This means 151 namenode objects for same data. 151 x 150 bytes = 22650 bytes. Even worse would be to have 150 files with 1MB each. That would require 150 inodes and 150 blocks = 300 x 150 bytes = 45000 bytes. See how this all changes. That's why we don't recommend small files for Hadoop.
Now assuming 128 MB file blocks, on average 1GB of memory is required for 1 million blocks.
Now let's do this calculation at PB scale.
Assume 6000 TB of data. That's a lot. I am sure your large file is less than that.
Imagine 30 TB capacity for each node. this will require 200 nodes.
At 128 MB block size, and replication factor of 3.
Cluster capacity in MB = 30 x 1000 (convert to GB) x 1000 (convert to MB) x 200 nodes = 6 000000000 MB (6000 TB)
How many blocks can we store in this cluster?
6 000 000 000 MB/128 MB = 46875000 (that's 47 million blocks)
Assume 1 GB of memory required per million blocks, you need a mere 46875000 blocks / 1000000 blocks per GB = 46 GB of memory.
Namenodes with 64-128 GB memory are quite common. You can do a few things here.
1. Increase the block size to 256 MB and that will save you quite a bit of namenode space. At the scale you are talking about, you should do that regardless. May be even 384-512 MB.
2. Get more memory for name node. Probably 256 GB or even 512 GB servers.
Finally, read the following.
https://issues.apache.org/jira/browse/HADOOP-1687
and following (notice for 40-50 million files only 24 GB is recommended - half of our calculations. Probably because block size assumed at that scale is 256 MB rather than 128 MB)
Created 08-16-2016 11:06 AM
Yes, its my imagination resulted out of looking into the bigger picture theoretically than work related. Also was trying to forsee a situation where it becomes necessary for NameNodes to talk to each other, when they have an application's huge NameSpace data getting partitioned(due to RAM constraints) and resides in other NameNodes. From your explanation and the links provided, its clear that such a situation doesnt exist and is most likely not going to occur in real world scenarios. Thus the need for NameNodes to talk to each other is not at all required.