Support Questions

Find answers, ask questions, and share your expertise

Hadoop cluster failover

avatar
Rising Star

Hi All,

I'm interested by any paper or returns of experience regarding the Hadoop cluster fail over design.

From my understanding 2 mains components concerned are NameNode and ResourceManager.

Thanks

Regards

Farhad

1 ACCEPTED SOLUTION

avatar
Super Guru
@Farhad Heybati

There are a number of components in Hadoop and its ecosystem. Each of them have their own high availability/failover strategy and different implications in case of a failure. You already mentioned Namenode and YARN Resource Manager. But there are others like HiveServer2, Hive Metastore, HMaster if you are using HBase and other components. Each have it's own documentation available on Hortonworks website.

1. Namenode also known as Master node is the linchpin of Hadoop. If namenode fails, your cluster is officially lost. To avoid this scenario, you must configure standby namenode. Instructions to setup Namenode HA can be found here.

2. YARN Resource Manager. YARN manages your cluster resources. Basically what job/application should get how much memory/cpu resources is allocated using YARN. So that's pretty important. While YARN has concept of Application Master, Node Manager and Container but what is really a single point of failure is YARN resource manager. So you need HA for Resource Manager. Check these two links. link 1 and link 2.

3. Hive Server2. What if you are using Hive (SQL) to query structured data in Hadoop? Assume you have multiple concurrent jobs running or adhoc users running their queries and they connect to Hive using HiveServer2. What if HiveServer2 goes down? Well, you need redundancy for that. Here is how you do it.

4. Are you using HBase? HBase has a component called HMaster. While not quite as crucial as HiveServer2 or Resource Manager but if HMaster goes down, you might see an impact specially if a region server also goes down before you are able to bring HMaster up. So you need to setup HA for HMaster. Check this link.

I hope this helps. If you have any followup question, please feel free to ask.

View solution in original post

3 REPLIES 3

avatar
Super Guru
@Farhad Heybati

There are a number of components in Hadoop and its ecosystem. Each of them have their own high availability/failover strategy and different implications in case of a failure. You already mentioned Namenode and YARN Resource Manager. But there are others like HiveServer2, Hive Metastore, HMaster if you are using HBase and other components. Each have it's own documentation available on Hortonworks website.

1. Namenode also known as Master node is the linchpin of Hadoop. If namenode fails, your cluster is officially lost. To avoid this scenario, you must configure standby namenode. Instructions to setup Namenode HA can be found here.

2. YARN Resource Manager. YARN manages your cluster resources. Basically what job/application should get how much memory/cpu resources is allocated using YARN. So that's pretty important. While YARN has concept of Application Master, Node Manager and Container but what is really a single point of failure is YARN resource manager. So you need HA for Resource Manager. Check these two links. link 1 and link 2.

3. Hive Server2. What if you are using Hive (SQL) to query structured data in Hadoop? Assume you have multiple concurrent jobs running or adhoc users running their queries and they connect to Hive using HiveServer2. What if HiveServer2 goes down? Well, you need redundancy for that. Here is how you do it.

4. Are you using HBase? HBase has a component called HMaster. While not quite as crucial as HiveServer2 or Resource Manager but if HMaster goes down, you might see an impact specially if a region server also goes down before you are able to bring HMaster up. So you need to setup HA for HMaster. Check this link.

I hope this helps. If you have any followup question, please feel free to ask.

avatar
Explorer

Hi,

I have some questions about the Hadoop Cluster data node failover:

  • What happened the link is down between the name node and a data node (or between 2 data nodes) when the Hadoop cluster is processing some data? Does Hadoop cluster have any OOTB to recover this problem?
  • What happens one data node is down when the Hadoop cluster is processing some data?

Also, another question is about the Hadoop cluster hardware configuration. Let's say we will use our Hadoop cluster to process 100GB log files each day, how many data nodes do we need to set up? And for each data node hardware configuration(e.g. CPU, RAM, Harddisk)?

Thank You

Hari