Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

When the FSImage file is large (like 30 GB or more), sometimes due to other contributing factors like RPC bandwidth, network congestion, request queue length etc, it can take a long time to upload/download. This in turn can leads the Zookeeper to believe that the NameNode is not responding. It displays the SERVICE_NOT_RESPONDING status. Thereafter, it triggers a failover transition.


The logs display the following statements:
2017-09-04 05:02:26,017 INFO namenode.TransferFsImage (TransferFsImage.java:receiveFile(575)) - 
"Combined time for fsimage download and fsync to all disks took 237.14s. 
The fsimage download took 237.14s at 141130.21 KB/s. 
Synchronous (fsync) write to disk of /opt/hadoop/hdfs/namenode/image/current/fsimage.ckpt_0000000012106114957 
took 0.00s. Synchronous (fsync) write to disk of 
/var/hadoop/hdfs/namenode/image/current/fsimage.ckpt_0000000012106114957 took 0.00s..

2017-09-04 05:02:26,018 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - 
Remote journal 192.168.1.1:8485 failed to write txns 12106579989-12106579989. 
Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
IPC's epoch 778 is less than the last promised epoch 779
2017-09-04 05:02:26,019 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - 
Transport-level exception trying to monitor health of namenode at nn1.test.com/192.168.1.2:8023: 
java.io.EOFException End of File Exception between local host is: "nn1.test.com/192.168.1.2"; 
destination host is: "nn1.test.com/192.168.1.2":8023; : java.io.EOFException; 
For more details see: http://wiki.apache.org/hadoop/EOFException 

2017-09-04 05:02:26,020 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) - 
Entering state SERVICE_NOT_RESPONDING 

2017-09-04 05:02:26,021 INFO ha.ZKFailoverController (
ZKFailoverController.java:setLastHealthState(852)) - 
Local service NameNode at nn1.test.com/192.168.1.2:8023 Entered state: SERVICE_NOT_RESPONDING

If the contributing factors are not addressed and FSImage file size continues to be high, then such fail overs will become very frequent (3 or more times in a week).

Root Cause

This issue occurs in the following scenarios:

  1. The FSImage upload/download is making the disk/network too busy, which is causing request queues to build up and the NameNode to appear unresponsive.
  2. Typically, in overloaded clusters where the NameNode is too busy to process heartbeats, it spuriously marks DataNodes as dead. This scenario also leads to spurious fail overs.

Solution

To resolve this issue, do the following:

  1. Add image transfer throttling. Throttling will use less bandwidth for image transfers. Hence, although the transfer takes longer, the NameNode will remain more responsive throughout. Throttling can be enabled by setting dfs.image.transfer.bandwidthPerSec in hdfs-site.xml. It always expects value in bytes. The following example will limit the transfer bandwidth to 50MB/s.
    <property>
       <name>dfs.image.transfer.bandwidthPerSec</name>
       <value>50000000</value>
    </property>
  2. Enable the DataNode life protocol. This will reduce spurious failovers.

The Lifeline protocol is a feature recently added by the Apache Hadoop Community (see Apache HDFS Jira HDFS-9239). It introduces a new lightweight RPC message that is used by the DataNodes to report their health to the NameNode. It was developed in response to problems seen in some overloaded clusters where the NameNode was too busy to process heartbeats and spuriously marked DataNodes as dead.

For a non-HA cluster, the feature can be enabled with the following configuration in hdfs-site.xml:

<property>
   <name>dfs.namenode.lifeline.rpc-address</name>
   <value>mynamenode.example.com:8050</value>
</property>

(Replace mynamenode.example.com with the hostname or IP address of your namenode. The port number can be different too.)

For an HA cluster, the lifeline RPC port can be enabled with the following setup, replacing mycluster, nn1 and nn2 appropriately.

<property>
   <name>dfs.namenode.lifeline.rpc-address.mycluster.nn1</name>
   <value>mynamenode1.example.com:8050</value>
</property>

<property>
   <name>dfs.namenode.lifeline.rpc-address.mycluster.nn2</name>
   <value>mynamenode2.example.com:8050</value>
</property>

Additional lifeline protocol settings are documented in the HDFS-9239 release note. However, these can be left at their default values for most clusters.

Note: Changing the lifeline protocol settings requires a restart of the NameNodes, DataNodes and ZooKeeper Failover Controllers to take full effect. If you have NameNode HA setup, you can restart the NameNodes one at a time followed by a rolling restart of the remaining components to avoid cluster downtime.

For some amazing tips on Scaling HDFS, refer to this 4 part guide

3,861 Views