Support Questions

Mark_Petronic · ‎02-11-2016

I am trying to put together a hardware specification for name nodes running in HA mode. That made me have to think about disk allocation for name nodes. I pretty much get it with non-HA. Use one RAID drive and another NFS mount for redundancy. SNN incrementally applies changes in the edit log to the fsimage, etc. But I want to run HA. And I want to use Journal Nodes (JN) and the Quorum Journal Manager (QJM) approach. So, that made me think about this scenario and I was not sure I was getting it right and wanted to ask some gurus for input. Here's what I think... Can you please confirm or correct? I think a scenario type question will help me more easily ask the questions so here goes.

Assume a clean install. Primary and failover NNs both have empty fsimage files. Primary starts running and writing changes to all three JN's. As I understand it, the failover NN will be reading all those changes, via the JNs, and applying them to his empty fsimage to prepare it to be 100% complete should he be called to take over (faster startup time).

Now the primary fails. The failover NN starts up and reads in the fsimage file and starts accepting client requests as normal. It now starts to write edits to the JNs. But the formally primary NN is still down so it is NOT reading updates from the JNs. So, it's fsimage remains empty, essentially.

Next, I fix the formally primary NN and start it up. It now becomes the failover NN. At this point, I guess it starts reading changes from the JNs and building up its empty fsimage with all changes to date in hopes that it will once again rule the world and become active should the other NN fail some day.

Q1 - Is it true that the failover NN will NEVER have to apply any edit log changes at start up but simply loads its fsimage and starts running because it assumes fsimage is already 100% up to date via recent JN reads?

Q2 - In a setup with 3 JNs as a quorum, what should the disk layout look like on the three servers hosting those JNs? Because the edits are now distributed x3, should I just have a single disk per JN host dedicated to the JNs? No need for the one RAID and second NFS type arrangement used in non-HA mode? Specifically, the disk resources typically used for non-HA NN, where the NN writes edit log changes, now become disk resources used exclusively by the JNs, right? Meaning, the NNs never read/write anything directly to disk (except for configuration, I assume) but rather ALL goes through the JNs.

Q3 - I believe I still should have one dedicated disk for each JN on each host to isolate the unique work load of the NN for other processes. So, for example, there might be one disk for the OS, one for JNs, and another for the ZK instances that are sharing the same server to support the ZKFC. Correct?

Q4 - Because JNs are distributed, it makes me think I should treat these disks like I do disks on the DNs, meaning no RAID, just plain old JBOD. Does that sound right?

Q5 - Is it the NN on the failover server that actually does the JN reads and fsimage updates now in HA mode given that there is no SNN in such a configuration?

Thanks in advance for confirmation or any insight on this...

bleonhardi · ‎02-12-2016

"As I understand it, the failover NN will be reading all those changes, via the JNs,

That is true for file system changes. For block reports etc. the Datanodes communicate directly with both namenodes. They essentially duplicate every message to both instances. Thats the reason they have an almost identical in-memory image.

"Now the primary fails. The failover NN starts up and reads in the fsimage file and starts accepting client requests as normal. It now starts to write edits to the JNs. But the formally primary NN is still down so it is NOT reading updates from the JNs. So, it's fsimage remains empty, essentially."

The failover NN continuously reads the journalnode changes. So he has an almost current instance of the fsimage in memory just like the formerly active namenode as well.

"Q1 - Is it true that the failover NN will NEVER have to apply any edit log changes at start up but simply loads its fsimage and starts running because it assumes fsimage is already 100% up to date via recent JN reads?"

As written above. The failover NN does not start up. He is running in parallel and has an almost identical in-memory image as the active namenode. So when he takes over its practically instantaneous. He just has to read some changes from the journalnodes he didn't yet apply.

"Q2 - In a setup with 3 JNs as a quorum, what should the disk layout look like on the three servers hosting those JNs? Because the edits are now distributed x3, should I just have a single disk per JN host dedicated to the JNs? No need for the one RAID and second NFS type arrangement used in non-HA mode? Specifically, the disk resources typically used for non-HA NN, where the NN writes edit log changes, now become disk resources used exclusively by the JNs, right?

If possible the Journalnodes like the Namenodes should have raided data discs. It just reduces the chance that the journalnodes will die. In contrast to HDFS the volumes are not huge and the costs low. You can however colocate them with the Namenodes since they are pretty lightweight. No need for NFS though.

"Meaning, the NNs never read/write anything directly to disk (except for configuration, I assume) but rather ALL goes through the JNs."

The namenodes still checkpoint. The Journalnodes only write an edit log ( similar to a transaction log in a database ) The fsImage ( which is essentially a replica of the inmemory store ) is still written to disc regularly by the failover namenode who takes the job of the standby namenode in this.

"Q3 - I believe I still should have one dedicated disk for each JN on each host to isolate the unique work load of the NN for other processes. So, for example, there might be one disk for the OS, one for JNs, and another for the ZK instances that are sharing the same server to support the ZKFC. Correct?

Hmmm good question. I actually never heard of performance problems because of Journalnode IO. Not that it can hurt to separate them. But even assuming a huge cluster the number of transactions per second should be well below the write speed of a modern disc or SSD. Perhaps someone else has some numbers.

"Q4 - Because JNs are distributed, it makes me think I should treat these disks like I do disks on the DNs, meaning no RAID, just plain old JBOD. Does that sound right?

As said I would use RAID. It reduces the chances of a journalnode dying significantly ( which would then put you in danger of a second dying until the first JN is fixed) . It also doesn't seem to be a high cost. You do not use RAID for HDFS because of the high cost ( thousands of discs ) and because HDFS fixes discs automatically by recreating block replicas on different nodes. You have to fix the journalnode yourself. So RAID seems to be worth it.

Q5 - Is it the NN on the failover server that actually does the JN reads and fsimage updates now in HA mode given that there is no SNN in such a configuration?"

Yes the failover namenode doesn't need to read any fsimage anymore, he already has a carbon copy. So he writes a checkpoint regularly and distributes it to the active namenode.

Architecture:

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.ht...

Some transaction numbers for huge clusters:

https://developer.yahoo.com/blogs/hadoop/scalability-hadoop-distributed-file-system-452.html

View solution in original post

bleonhardi · ‎02-12-2016