I'm currently thinking about, how to design a recovery strategy for my Hadoop Cluster in case entire (bare-metal) nodes go down. The idea is, to install a new node using the old hostname and IP and use that node to replace a failed on. The missing piece is, how to tell Ambari to re-install all Hadoop components on a replaced node (with the same hostname and IP as the old node).
For slave nodes, the approach seems to be to remove and re-add all components and trigger an install thereafter (Source).
Is there any recommendation or guideline how to re-install all components on a failed MASTER node (i.e. a node hosting NameNode, ResourceManager, HBase Master, ...)?
I would think, that every Hadoop Admin is faced with those problems but don't find docs on that.
Best regards, Roland
I did not find any doc for your requirement. Though, from what I understand, you need an API to get you all host components installed in the host. If that is what you want, you may use the api :
This will give you a json with the host components installed on the host. You may parse this using python or if using shell script, use :
grep -o '"component_name" : [^, }]*' | sed 's/^.*: //' | tr -d '"'
on the output.
Hope this helps!
Hi @sbhat, thanks for your answer and the command to parse host components.
I believe the problem is to transition components into the "INSTALLED" state when they are already installed from the viewpoint of Ambari and therefore in a "STOPPED" state. This transition is not possible as far as I know.
In ambari, Installed and Stopped mean the same thing. Can you please elaborate on what transition you are referring to?
So, I have a master node that was installed by Ambari and has all components in the INSTALLED/STOPPED state.
This node has a hardware failure and I re-install it blank, so it has no Hadoop components (i.e. packages, ...) on it. Because I installed it with same hostname as the failed node, the hosts, however, is still in INSTALLED-state from the viewpoint of Ambari.
To re-install all components, I now need to transition the components to INSTALLED, but they all are in this exact state. Therefore, Ambari does not install the components.
@Roland Simonis and @Gerd Koenig Just a question related to this topic, if we have HA setup for name node (one Active and Standby Name node) and if one of the name node is gone (deleted or corrupted), and if I recreate the machine with same I.P and FDQN, Is ambari intelligent enough to trace that the newly created node is the other high available node? Or I still have to do some configuration from Ambari? Any ideas will be welcomed. Thanks.
Hi @Roland Simonis ,
usually all master components you mentioned are setup in a HA fashion, hence you just install a new server (no need for having same name/IP), add it in Ambari and assign roles to it.
Your situation, was it your only master node which failed or do you still have other master nodes up and running and you just want to have one additional master node back-in-business ?!?! I guess the easiest one would be to re-install your new node with different name/IP and add it to the cluster followed by assigning roles to it.
Hi @Gerd Koenig, you are right: All services (except of History Server) are setup as HA, so most services are still running, if a master goes down.
I see that adding a new node to Ambari an move / assign the roles to it might be a solution. This would, however, require adjusting the hostname in all connected systems. I'm doubtful, if the other departments are okay with that.
The same approach does not work with the same hostnames, does it?
Hi @Roland Simonis ,
there shouldn't be too much to re-configure, e.g. communication to HDFS in Namenode-HA goes through "namespace", not by talking to a namenode servername directly.
To get rid of your failed server in Ambari, you can try the following by using Ambari REST API:
* get a list of services assigned to your failed host => see first reply for the Ambari REST API call
* delete service by service, e.g. to delete the RESOURCEMANAGER service
curl -u admin:<admin-pw> -X DELETE -ik -H 'X-Requested-By: ambari' https://<ambari-server>:<port>/api/v1/clusters/<clustername>/hosts/<FQDN-of-failed-host>/host_compon...
If your failed node doesn't have any service assigned, you can delete that host in Ambari (under tab "Hosts"), and add it again later on.