Created 02-11-2016 09:08 PM
I am trying to put together a hardware specification for name nodes running in HA mode. That made me have to think about disk allocation for name nodes. I pretty much get it with non-HA. Use one RAID drive and another NFS mount for redundancy. SNN incrementally applies changes in the edit log to the fsimage, etc. But I want to run HA. And I want to use Journal Nodes (JN) and the Quorum Journal Manager (QJM) approach. So, that made me think about this scenario and I was not sure I was getting it right and wanted to ask some gurus for input. Here's what I think... Can you please confirm or correct? I think a scenario type question will help me more easily ask the questions so here goes.
Assume a clean install. Primary and failover NNs both have empty fsimage files. Primary starts running and writing changes to all three JN's. As I understand it, the failover NN will be reading all those changes, via the JNs, and applying them to his empty fsimage to prepare it to be 100% complete should he be called to take over (faster startup time).
Now the primary fails. The failover NN starts up and reads in the fsimage file and starts accepting client requests as normal. It now starts to write edits to the JNs. But the formally primary NN is still down so it is NOT reading updates from the JNs. So, it's fsimage remains empty, essentially.
Next, I fix the formally primary NN and start it up. It now becomes the failover NN. At this point, I guess it starts reading changes from the JNs and building up its empty fsimage with all changes to date in hopes that it will once again rule the world and become active should the other NN fail some day.
Q1 - Is it true that the failover NN will NEVER have to apply any edit log changes at start up but simply loads its fsimage and starts running because it assumes fsimage is already 100% up to date via recent JN reads?
Q2 - In a setup with 3 JNs as a quorum, what should the disk layout look like on the three servers hosting those JNs? Because the edits are now distributed x3, should I just have a single disk per JN host dedicated to the JNs? No need for the one RAID and second NFS type arrangement used in non-HA mode? Specifically, the disk resources typically used for non-HA NN, where the NN writes edit log changes, now become disk resources used exclusively by the JNs, right? Meaning, the NNs never read/write anything directly to disk (except for configuration, I assume) but rather ALL goes through the JNs.
Q3 - I believe I still should have one dedicated disk for each JN on each host to isolate the unique work load of the NN for other processes. So, for example, there might be one disk for the OS, one for JNs, and another for the ZK instances that are sharing the same server to support the ZKFC. Correct?
Q4 - Because JNs are distributed, it makes me think I should treat these disks like I do disks on the DNs, meaning no RAID, just plain old JBOD. Does that sound right?
Q5 - Is it the NN on the failover server that actually does the JN reads and fsimage updates now in HA mode given that there is no SNN in such a configuration?
Thanks in advance for confirmation or any insight on this...
Created 02-12-2016 01:43 AM
"As I understand it, the failover NN will be reading all those changes, via the JNs,
That is true for file system changes. For block reports etc. the Datanodes communicate directly with both namenodes. They essentially duplicate every message to both instances. Thats the reason they have an almost identical in-memory image.
"Now the primary fails. The failover NN starts up and reads in the fsimage file and starts accepting client requests as normal. It now starts to write edits to the JNs. But the formally primary NN is still down so it is NOT reading updates from the JNs. So, it's fsimage remains empty, essentially."
The failover NN continuously reads the journalnode changes. So he has an almost current instance of the fsimage in memory just like the formerly active namenode as well.
"Q1 - Is it true that the failover NN will NEVER have to apply any edit log changes at start up but simply loads its fsimage and starts running because it assumes fsimage is already 100% up to date via recent JN reads?"
As written above. The failover NN does not start up. He is running in parallel and has an almost identical in-memory image as the active namenode. So when he takes over its practically instantaneous. He just has to read some changes from the journalnodes he didn't yet apply.
"Q2 - In a setup with 3 JNs as a quorum, what should the disk layout look like on the three servers hosting those JNs? Because the edits are now distributed x3, should I just have a single disk per JN host dedicated to the JNs? No need for the one RAID and second NFS type arrangement used in non-HA mode? Specifically, the disk resources typically used for non-HA NN, where the NN writes edit log changes, now become disk resources used exclusively by the JNs, right?
If possible the Journalnodes like the Namenodes should have raided data discs. It just reduces the chance that the journalnodes will die. In contrast to HDFS the volumes are not huge and the costs low. You can however colocate them with the Namenodes since they are pretty lightweight. No need for NFS though.
"Meaning, the NNs never read/write anything directly to disk (except for configuration, I assume) but rather ALL goes through the JNs."
The namenodes still checkpoint. The Journalnodes only write an edit log ( similar to a transaction log in a database ) The fsImage ( which is essentially a replica of the inmemory store ) is still written to disc regularly by the failover namenode who takes the job of the standby namenode in this.
"Q3 - I believe I still should have one dedicated disk for each JN on each host to isolate the unique work load of the NN for other processes. So, for example, there might be one disk for the OS, one for JNs, and another for the ZK instances that are sharing the same server to support the ZKFC. Correct?
Hmmm good question. I actually never heard of performance problems because of Journalnode IO. Not that it can hurt to separate them. But even assuming a huge cluster the number of transactions per second should be well below the write speed of a modern disc or SSD. Perhaps someone else has some numbers.
"Q4 - Because JNs are distributed, it makes me think I should treat these disks like I do disks on the DNs, meaning no RAID, just plain old JBOD. Does that sound right?
As said I would use RAID. It reduces the chances of a journalnode dying significantly ( which would then put you in danger of a second dying until the first JN is fixed) . It also doesn't seem to be a high cost. You do not use RAID for HDFS because of the high cost ( thousands of discs ) and because HDFS fixes discs automatically by recreating block replicas on different nodes. You have to fix the journalnode yourself. So RAID seems to be worth it.
Q5 - Is it the NN on the failover server that actually does the JN reads and fsimage updates now in HA mode given that there is no SNN in such a configuration?"
Yes the failover namenode doesn't need to read any fsimage anymore, he already has a carbon copy. So he writes a checkpoint regularly and distributes it to the active namenode.
Architecture:
Some transaction numbers for huge clusters:
https://developer.yahoo.com/blogs/hadoop/scalability-hadoop-distributed-file-system-452.html
Created 02-12-2016 01:43 AM
"As I understand it, the failover NN will be reading all those changes, via the JNs,
That is true for file system changes. For block reports etc. the Datanodes communicate directly with both namenodes. They essentially duplicate every message to both instances. Thats the reason they have an almost identical in-memory image.
"Now the primary fails. The failover NN starts up and reads in the fsimage file and starts accepting client requests as normal. It now starts to write edits to the JNs. But the formally primary NN is still down so it is NOT reading updates from the JNs. So, it's fsimage remains empty, essentially."
The failover NN continuously reads the journalnode changes. So he has an almost current instance of the fsimage in memory just like the formerly active namenode as well.
"Q1 - Is it true that the failover NN will NEVER have to apply any edit log changes at start up but simply loads its fsimage and starts running because it assumes fsimage is already 100% up to date via recent JN reads?"
As written above. The failover NN does not start up. He is running in parallel and has an almost identical in-memory image as the active namenode. So when he takes over its practically instantaneous. He just has to read some changes from the journalnodes he didn't yet apply.
"Q2 - In a setup with 3 JNs as a quorum, what should the disk layout look like on the three servers hosting those JNs? Because the edits are now distributed x3, should I just have a single disk per JN host dedicated to the JNs? No need for the one RAID and second NFS type arrangement used in non-HA mode? Specifically, the disk resources typically used for non-HA NN, where the NN writes edit log changes, now become disk resources used exclusively by the JNs, right?
If possible the Journalnodes like the Namenodes should have raided data discs. It just reduces the chance that the journalnodes will die. In contrast to HDFS the volumes are not huge and the costs low. You can however colocate them with the Namenodes since they are pretty lightweight. No need for NFS though.
"Meaning, the NNs never read/write anything directly to disk (except for configuration, I assume) but rather ALL goes through the JNs."
The namenodes still checkpoint. The Journalnodes only write an edit log ( similar to a transaction log in a database ) The fsImage ( which is essentially a replica of the inmemory store ) is still written to disc regularly by the failover namenode who takes the job of the standby namenode in this.
"Q3 - I believe I still should have one dedicated disk for each JN on each host to isolate the unique work load of the NN for other processes. So, for example, there might be one disk for the OS, one for JNs, and another for the ZK instances that are sharing the same server to support the ZKFC. Correct?
Hmmm good question. I actually never heard of performance problems because of Journalnode IO. Not that it can hurt to separate them. But even assuming a huge cluster the number of transactions per second should be well below the write speed of a modern disc or SSD. Perhaps someone else has some numbers.
"Q4 - Because JNs are distributed, it makes me think I should treat these disks like I do disks on the DNs, meaning no RAID, just plain old JBOD. Does that sound right?
As said I would use RAID. It reduces the chances of a journalnode dying significantly ( which would then put you in danger of a second dying until the first JN is fixed) . It also doesn't seem to be a high cost. You do not use RAID for HDFS because of the high cost ( thousands of discs ) and because HDFS fixes discs automatically by recreating block replicas on different nodes. You have to fix the journalnode yourself. So RAID seems to be worth it.
Q5 - Is it the NN on the failover server that actually does the JN reads and fsimage updates now in HA mode given that there is no SNN in such a configuration?"
Yes the failover namenode doesn't need to read any fsimage anymore, he already has a carbon copy. So he writes a checkpoint regularly and distributes it to the active namenode.
Architecture:
Some transaction numbers for huge clusters:
https://developer.yahoo.com/blogs/hadoop/scalability-hadoop-distributed-file-system-452.html
Created 02-12-2016 06:50 PM
, great answer! I'd just like to add a shameless plug for a blog I wrote about how the metadata files are managed on disk by the NameNodes and JournalNodes. http://hortonworks.com/blog/hdfs-metadata-directories-explained/ This might be interesting for anyone who'd like a deeper dive on the on-disk metadata and configuration settings that control the particulars of the checkpointing process.
Created 02-12-2016 08:23 PM
+1 for shameless plugs. Nice article and thank you for it!
Created 02-12-2016 11:16 PM
That is great! Thanks Chris
Created 01-18-2018 04:24 AM
can you please answer to this post
https://community.hortonworks.com/questions/162789/how-actually-namenode-ha-qjm-works.html
Created 02-12-2016 08:22 PM
@Benjamin Leonhardi, Nice writeup. Thank you for taking the time to be so thorough!!! All the links were very helpful, too. You read my mind on the Yahoo performance link - that was the next topic I was going to research. 🙂
Couple follow up clarification questions/comments...
Q1 - Chris' blog (thanks @Chris Nauroth) answered the remaining point. Only the "edits in progress" changes need to be applied to the fsimage by the failover NN when it takes over, all the completed edits on the JNs should have already been applied.
Q2 - So, my focus was on the JNs writting to each of their own disks and I completely missed the point that the NN needs some place to build the fsimage file. So, would you just point that to the same disk used but the JNs (assuming that I am going to collocate the JNs on the same host as the NNs?
Q3 - I was thinking more about a disk failure were a failed disk means a failed JN. So separate disks for each JN means more reliability. Do you recommend some other arrangement?
Q5 - "So he writes a checkpoint regularly and distributes it to the active namenode." You mean "distributes it" through the JN's in the normal HA manner of publishing edits, right, or am I missing something here?
Created 02-12-2016 11:30 PM
Q5: So the Journalnodes are for the transaction log. But there is still the fsImage which is the completely built Filesystem image. This is the "checkpoint". If a Namenode restarts he would take forever rebuilding this image from the Journalnode transaction log. Instead he reads the last version of the fsimage he has from disc and then applies any transactions he is still missing. Writing this takes time so its not done by the active namenode. In the old version the stamdby namenode would take the transaction log of the active namenode copy it over (scp whatever) merge it with the last fsimage, write it again and then copy it back to the active Namenode. In the HA settings this is similar but the failover namenode already has a current version of the image in memory he just needs to save it to disc and copy it over to the fsimage folder of the active namenode once in a while.
Q3/Q2: Chris might have a better idea. I think its clear that RAID is better for master nodes to reduce any likelihood of failure in the first place. Or in other words assuming you have only three master nodes and have to colocate a JN and a Namenode I would rather have the namenode and journalnode point to the same raided disc than to two unraided ones. Regarding performance I have seen problems with big Namenodes during rebuild of the fsimage after a failure. But this was not due to disc performance the bottleneck was in the namenode memory building up the hashmap.
Created 02-13-2016 06:53 PM
Thanks for the clarifications. All makes sense now.