Created on 10-15-2017 10:08 PM
When the FSImage file is large (like 30 GB or more), sometimes due to other contributing factors like RPC bandwidth, network congestion, request queue length etc, it can take a long time to upload/download. This in turn can leads the Zookeeper to believe that the NameNode is not responding. It displays the SERVICE_NOT_RESPONDING status. Thereafter, it triggers a failover transition.
2017-09-04 05:02:26,017 INFO namenode.TransferFsImage (TransferFsImage.java:receiveFile(575)) - "Combined time for fsimage download and fsync to all disks took 237.14s. The fsimage download took 237.14s at 141130.21 KB/s. Synchronous (fsync) write to disk of /opt/hadoop/hdfs/namenode/image/current/fsimage.ckpt_0000000012106114957 took 0.00s. Synchronous (fsync) write to disk of /var/hadoop/hdfs/namenode/image/current/fsimage.ckpt_0000000012106114957 took 0.00s.. 2017-09-04 05:02:26,018 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(388)) - Remote journal 192.168.1.1:8485 failed to write txns 12106579989-12106579989. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 778 is less than the last promised epoch 779 2017-09-04 05:02:26,019 WARN ha.HealthMonitor (HealthMonitor.java:doHealthChecks(211)) - Transport-level exception trying to monitor health of namenode at nn1.test.com/192.168.1.2:8023: java.io.EOFException End of File Exception between local host is: "nn1.test.com/192.168.1.2"; destination host is: "nn1.test.com/192.168.1.2":8023; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException 2017-09-04 05:02:26,020 INFO ha.HealthMonitor (HealthMonitor.java:enterState(249)) - Entering state SERVICE_NOT_RESPONDING 2017-09-04 05:02:26,021 INFO ha.ZKFailoverController ( ZKFailoverController.java:setLastHealthState(852)) - Local service NameNode at nn1.test.com/192.168.1.2:8023 Entered state: SERVICE_NOT_RESPONDING
If the contributing factors are not addressed and FSImage file size continues to be high, then such fail overs will become very frequent (3 or more times in a week).
Root Cause
This issue occurs in the following scenarios:
Solution
To resolve this issue, do the following:
<property> <name>dfs.image.transfer.bandwidthPerSec</name> <value>50000000</value> </property>
The Lifeline protocol is a feature recently added by the Apache Hadoop Community (see Apache HDFS Jira HDFS-9239). It introduces a new lightweight RPC message that is used by the DataNodes to report their health to the NameNode. It was developed in response to problems seen in some overloaded clusters where the NameNode was too busy to process heartbeats and spuriously marked DataNodes as dead.
For a non-HA cluster, the feature can be enabled with the following configuration in hdfs-site.xml:
<property> <name>dfs.namenode.lifeline.rpc-address</name> <value>mynamenode.example.com:8050</value> </property>
(Replace mynamenode.example.com with the hostname or IP address of your namenode. The port number can be different too.)
For an HA cluster, the lifeline RPC port can be enabled with the following setup, replacing mycluster, nn1 and nn2 appropriately.
<property> <name>dfs.namenode.lifeline.rpc-address.mycluster.nn1</name> <value>mynamenode1.example.com:8050</value> </property> <property> <name>dfs.namenode.lifeline.rpc-address.mycluster.nn2</name> <value>mynamenode2.example.com:8050</value> </property>
Additional lifeline protocol settings are documented in the HDFS-9239 release note. However, these can be left at their default values for most clusters.
Note: Changing the lifeline protocol settings requires a restart of the NameNodes, DataNodes and ZooKeeper Failover Controllers to take full effect. If you have NameNode HA setup, you can restart the NameNodes one at a time followed by a rolling restart of the remaining components to avoid cluster downtime.
For some amazing tips on Scaling HDFS, refer to this 4 part guide