Created 10-15-2016 08:39 AM
What's you cluster setup like ? Are you using Yarn? Is it setup to function in HA ? http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html
If so, then you can rely on Yarn to handle your application to run in HA and to handle all the ressource allocations for you. That's the beauty of using Yarn to run your hadoop applications.
All you need to do is tell spark to run your applications against Yarn : http://spark.apache.org/docs/latest/running-on-yarn.html
Thanks Matthieu for the response. I am running spark applications on Yarn deployment mode, on a namenode which is configured already HA on Zookeeper. What I am trying to understand, what will happen to spark jobs when namenode is down/standby, I need to say to zookeeper to take care of spark applications on switch (Namenode-Active, Secondary- standby)? Whenever I search it says HA on standalone but I am running in yarn deploy mode. Is my understanding right, or something missing? Anything I need to configure spark parameters to zookeeper?
You mean to say all application submitted on a namenode-Active(Machine : IP1, configured HA with zookeeper with IP2), on switch standby(Machine : IP2) will automatically migrate/start the spark job/processes/appilcations in IP2 with no specific configurations for spark?
Having your Namenode running in HA in not enough, your Ressource Manager (which handles Yarn management) also needs to be configured in HA (cf. first link in my answer above).
Having your namenode in HA allows you to continue to have access to HDFS in case of the failure of active NameNode. However, it doesn't handle ressource allocation for application that's what the Ressource Manager (YARN) is for.
Let's go through a few failure scenarios :
1. Ressource Manager fails but the containers (application master + slaves + driver) linked to that particular application are unaffected. => Your application continues to run as if nothing happened. You won't be able to submit new apps until the ressource Manager is back up or the standby has been brought to active (in case of HA)
2. One of the slave containers fails.
=> The application manager spawns a new container to take over. The spark task it was handling might fail but it will be replayed
3. Application Master container fails.
=> The application fails but will be re-spawned by the Ressource Manager.