I know that the MapReduce JobHistory server does not support HA by itself (this has been asked before here and a related question here) but I thought I could ask about comments on a hackish way to get it HA-like (we are looking for both replication of state and automatic failover). The background to this question is that we are trying to make every service running on our cluster highly available. The goal is to be completely tolerant of single-machine failures. At the moment, we know that at least oozie needs to have the jobhistory server responding in order to run jobs.
Our idea is to run two instances in parallell, and put a load balancer in front. Clients should access the service using the load balancer address. The load balancer would only let through traffic to one of the nodes at any single time. When it determines that the server it currently forwards to does not respond, it changes to the other server. The recovery state would be kept in either the leveldb store or the filesystem store, but the file would be on an NFS disk so accessible by both servers.
Since the history server doesn't only respond to requests but also does scheduled housekeeping work on shared state as well (in hdfs, moving jobhistory files), perhaps it wouldn't be a good idea to have two servers running at the same time, even if requests are only directed at one of them at a time.
A slight modification of the idea would be to not even have a second server running until the first one stops to respond. At the same time the load balancer detects that the current server has stopped responding, a trigger would deploy and start a new instance (perhaps via scripted calls to the ambari REST api). Then the load balancer would direct traffic to this instance.
Does anyone have any comments about this idea or know if others have solved this problem?
Perhaps not that relevant to this question but we are running HDP 2.4.3 at the moment.