Reply
Contributor
Posts: 27
Registered: ‎07-19-2016

Should Journal nodes and Zookeeper nodes be on same host as the Namenodes in HA setup?

[ Edited ]

 

TLDR;

Should Journal nodes and Zookeeper nodes be on same host as the Namenodes in HA setup?

The point is that losing a NN+ZK+JN node will leave only two JK and JN in the cluster. 

Are the remaining two ZKs and JNs enough for the promotion of the standby NN to active?

 

 

Long version:

We have a HA cluster that simplified looks like this:

master1: NN+JN+ZK

master2: NN+JN+ZK

mgmt1: CM+JN+ZK

 

Due to maintenence all three nodes lost connections between them which caused the following to happen:

 

Both Failover controllers got timeouts from 2 of the Zookeepers (majority) and shutsdown.

The active Namenode shutdown because it timedout while waiting for a quorum of Journal nodes to respond (only the local one did).

Since the failover controllers were down the standby NN never become active (it also got timeouts from a majority of JN by the way).

The Zookeepers threw generic error which seems to mean that there is only one ZK, there is even number of ZK or that it can't communicate with other ZKs.

 

1. Will it be correct to say that having all three nodes loose connection to each other is not a scenario in which the HA failover can occur?

 

2. Does the Failover Controller needs to be able to reach all three JN in order for it to trigger a failover? I am trying to figure out if moving the JN and the ZK to different hosts than the ones running NN would have helped.

 

3. Will it make sense to spread the two masters and the mgmt node in different datacenter in order to midiate the possibility of losing all at once if a datacenter goes down?

Highlighted
Expert Contributor
Posts: 152
Registered: ‎07-01-2015

Re: Should Journal nodes and Zookeeper nodes be on same host as the Namenodes in HA setup?

I am not an expert, but for the 1st question I know quite sure the answer.
HA is NOT handling the loss of connectivity in the cluster (each nodes). This of course brings down the services. HA is handling just the outage (loss of connectivity) of ONE server agains the rest.

This is my opinion..
Announcements