Support Questions

Find answers, ask questions, and share your expertise

Namenode and secondary namenode are flapping and checkpoint error afterward

New Contributor

HDP: 2.6.2.14

HDFS: 2.7.3

We are seeing that our primary namenode and secondary namenode are flopping back and forth. The secondary namenode did not start back up and we had to do a manual force checkpoint to put the namenode back in a stable condition. I have extracted portions of the alerts log showing the flopping. A couple of things that concerns me is the "Upgrade Finalized State" and "Namenode RPC latency" error. Has anyone seen these errors before or know the cause of it.

Thanks

John

-------

/var/log/ambari-alerts.log

2018-01-12 05:23:09,014 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-05 20:39:02,025 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[],

2018-02-05 20:39:25,999 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[],

2018-02-07 16:38:09,017 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-09 12:02:25,028 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[],

2018-02-09 12:03:03,008 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[],

2018-02-14 19:14:29,001 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'],

2018-02-14 19:15:06,012 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[],

2018-02-19 17:23:12,016 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-21 17:54:08,003 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-24 05:21:07,026 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-25 03:46:11,083 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-03-08 16:18:10,995 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-03-09 20:18:10,003 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-03-12 04:35:03,009 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[],

2018-03-12 04:35:26,013 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[],

2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [namenode_last_checkpoint] (NameNode Last Checkpoint) Traceback (most recent call last):

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 203, in execute

"LastCheckpointTime", connection_timeout))

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 246, in get_value_from_jmx

response = urllib2.urlopen(query, timeout=connection_timeout)

File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "/usr/lib64/python2.7/urllib2.py", line 431, in open

response = self._open(req, data)

File "/usr/lib64/python2.7/urllib2.py", line 449, in _open

'_open', req)

File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain

result = func(*args)

File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 111] Connection refused>

2018-03-12 04:35:26,013 [CRITICAL] [SOFT] [HDFS] [namenode_webui] (NameNode Web UI) Connection failed to http://srv-02-01-b07.XXX.YYY.io:50070 (<urlopen error [Errno 111] Connection refused>)

2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [upgrade_finalized_state] (HDFS Upgrade Finalized State) Traceback (most recent call last):

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 140, in execute

"UpgradeFinalized"))

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 169, in get_value_from_jmx

response = urllib2.urlopen(query, timeout=int(CONNECTION_TIMEOUT_DEFAULT))

File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "/usr/lib64/python2.7/urllib2.py", line 431, in open

response = self._open(req, data)

File "/usr/lib64/python2.7/urllib2.py", line 449, in _open

'_open', req)

File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain

result = func(*args)

File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 111] Connection refused>

2018-03-12 04:35:26,013 [CRITICAL] [HARD] [HDFS] [namenode_ha_health] (NameNode High Availability Health) Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[], Unknown['srv-02-01-b07.XXX.YYY.io:50070']

2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [namenode_directory_status] (NameNode Directory Status) [Alert][namenode_directory_status] Unable to extract JSON from JMX response

2018-03-12 04:35:29,992 [UNKNOWN] [SOFT] [HDFS] [datanode_health_summary] (DataNode Health Summary) [Alert][datanode_health_summary] Unable to extract JSON from JMX response

2018-03-12 04:36:26,019 [UNKNOWN] [HARD] [HDFS] [namenode_last_checkpoint] (NameNode Last Checkpoint) Traceback (most recent call last):

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 203, in execute

"LastCheckpointTime", connection_timeout))

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 246, in get_value_from_jmx

response = urllib2.urlopen(query, timeout=connection_timeout)

File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "/usr/lib64/python2.7/urllib2.py", line 431, in open

response = self._open(req, data)

File "/usr/lib64/python2.7/urllib2.py", line 449, in _open

'_open', req)

File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain

result = func(*args)

File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 111] Connection refused>

2018-03-12 04:36:26,019 [CRITICAL] [HARD] [HDFS] [namenode_webui] (NameNode Web UI) Connection failed to http://srv-02-01-b07.XXX.YYY.io:50070 (<urlopen error [Errno 111] Connection refused>)

2018-03-12 04:36:26,069 [UNKNOWN] [HARD] [HDFS] [upgrade_finalized_state] (HDFS Upgrade Finalized State) Traceback (most recent call last):

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 140, in execute

"UpgradeFinalized"))

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 169, in get_value_from_jmx

response = urllib2.urlopen(query, timeout=int(CONNECTION_TIMEOUT_DEFAULT))

File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "/usr/lib64/python2.7/urllib2.py", line 431, in open

response = self._open(req, data)

File "/usr/lib64/python2.7/urllib2.py", line 449, in _open

'_open', req)

File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain

result = func(*args)

File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 111] Connection refused>

2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_pending_deletion_blocks] (HDFS Pending Deletion Blocks) [Alert][namenode_hdfs_pending_deletion_blocks] Unable to extract JSON from JMX response

2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_blocks_health] (NameNode Blocks Health) [Alert][namenode_hdfs_blocks_health] Unable to extract JSON from JMX response

2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_capacity_utilization] (HDFS Capacity Utilization) [Alert][namenode_hdfs_capacity_utilization] Unable to extract JSON from JMX response

2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_rpc_latency] (NameNode RPC Latency) [Alert][namenode_rpc_latency] Unable to extract JSON from JMX response

2018-03-12 04:36:26,077 [UNKNOWN] [HARD] [HDFS] [namenode_directory_status] (NameNode Directory Status) [Alert][namenode_directory_status] Unable to extract JSON from JMX response

2018-03-12 04:36:29,998 [UNKNOWN] [HARD] [HDFS] [datanode_health_summary] (DataNode Health Summary) [Alert][datanode_health_summary] Unable to extract JSON from JMX response

2018-03-12 04:38:25,024 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_pending_deletion_blocks] (HDFS Pending Deletion Blocks) [Alert][namenode_hdfs_pending_deletion_blocks] Unable to extract JSON from JMX response

2018-03-12 04:38:25,024 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_blocks_health] (NameNode Blocks Health) [Alert][namenode_hdfs_blocks_health] Unable to extract JSON from JMX response

2018-03-12 04:38:25,146 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_capacity_utilization] (HDFS Capacity Utilization) [Alert][namenode_hdfs_capacity_utilization] Unable to extract JSON from JMX response

2018-03-12 04:38:25,158 [UNKNOWN] [HARD] [HDFS] [namenode_rpc_latency] (NameNode RPC Latency) [Alert][namenode_rpc_latency] Unable to extract JSON from JMX response

2018-03-12 16:49:31,994 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'],

2018-03-17 04:17:29,023 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'],

/var/log/ambari-server.log


12 Mar 2018 04:04:35,129INFO [pool-17-thread-1] MetricSinkWriteShardHostnameHashingStrategy:42 - Calculated collector shard srv-02-05-b07.XXX.YYY.io based on hostname: srv-02-01-b07.XXX.YYY.io

12 Mar 2018 04:35:50,996INFO [ambari-heartbeat-processor-0] HeartbeatProcessor:609 - State of service component NAMENODE of service HDFS of cluster pipeline has changed from STARTED to INSTALLED at host srv-02-01-b07.XXX.YYY.io according to STATUS_COMMAND report

12 Mar 2018 04:35:53,857INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 1 pending alert notices about to be dispatched...

12 Mar 2018 04:35:53,873INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[1], Unknown[0]}

12 Mar 2018 04:37:53,875INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 5 pending alert notices about to be dispatched...

12 Mar 2018 04:37:53,927INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[1], Unknown[4]}

12 Mar 2018 04:39:53,928INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 4 pending alert notices about to be dispatched...

12 Mar 2018 04:39:53,967INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[0], Unknown[4]}

12 Mar 2018 04:43:53,968INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 1 pending alert notices about to be dispatched...

12 Mar 2018 04:43:53,989INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[0], Unknown[1]}

12 Mar 2018 05:14:38,691INFO [pool-17-thread-1] MetricSinkWriteShardHostnameHashingStrategy:42 - Calculated collector shard srv-02-05-b07.XXX.YYY.io based on hostname: srv-02-01-b07.XXX.YYY.io

1 REPLY 1

Rising Star

@John Liao , thanks for reporting the issue. Have you checked the hdfs audit log during that period of time when the NN flipped?

It is likely you have a spike in the number of RPC requests (e.g., large recursive delete, getContentSummary call, etc) toward the active NN that overwhelm the call queue. As a result of that, ZKFC cannot get any response within the configured timeout (default=45s) and triggered failover.

Please check @Arpit Agarwal's previous articles on best practice to scale your namenode to handle similar situations:

https://community.hortonworks.com/articles/43838/scaling-the-hdfs-namenode-part-1.html

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.