Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Namenode and secondary namenode are flapping and checkpoint error afterward

Namenode and secondary namenode are flapping and checkpoint error afterward

New Contributor

HDP: 2.6.2.14

HDFS: 2.7.3

We are seeing that our primary namenode and secondary namenode are flopping back and forth. The secondary namenode did not start back up and we had to do a manual force checkpoint to put the namenode back in a stable condition. I have extracted portions of the alerts log showing the flopping. A couple of things that concerns me is the "Upgrade Finalized State" and "Namenode RPC latency" error. Has anyone seen these errors before or know the cause of it.

Thanks

John

-------

/var/log/ambari-alerts.log

2018-01-12 05:23:09,014 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-05 20:39:02,025 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[],

2018-02-05 20:39:25,999 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[],

2018-02-07 16:38:09,017 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-09 12:02:25,028 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[],

2018-02-09 12:03:03,008 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[],

2018-02-14 19:14:29,001 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'],

2018-02-14 19:15:06,012 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[],

2018-02-19 17:23:12,016 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-21 17:54:08,003 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-24 05:21:07,026 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-02-25 03:46:11,083 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-03-08 16:18:10,995 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-03-09 20:18:10,003 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'],

2018-03-12 04:35:03,009 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[],

2018-03-12 04:35:26,013 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[],

2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [namenode_last_checkpoint] (NameNode Last Checkpoint) Traceback (most recent call last):

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 203, in execute

"LastCheckpointTime", connection_timeout))

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 246, in get_value_from_jmx

response = urllib2.urlopen(query, timeout=connection_timeout)

File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "/usr/lib64/python2.7/urllib2.py", line 431, in open

response = self._open(req, data)

File "/usr/lib64/python2.7/urllib2.py", line 449, in _open

'_open', req)

File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain

result = func(*args)

File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 111] Connection refused>

2018-03-12 04:35:26,013 [CRITICAL] [SOFT] [HDFS] [namenode_webui] (NameNode Web UI) Connection failed to http://srv-02-01-b07.XXX.YYY.io:50070 (<urlopen error [Errno 111] Connection refused>)

2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [upgrade_finalized_state] (HDFS Upgrade Finalized State) Traceback (most recent call last):

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 140, in execute

"UpgradeFinalized"))

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 169, in get_value_from_jmx

response = urllib2.urlopen(query, timeout=int(CONNECTION_TIMEOUT_DEFAULT))

File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "/usr/lib64/python2.7/urllib2.py", line 431, in open

response = self._open(req, data)

File "/usr/lib64/python2.7/urllib2.py", line 449, in _open

'_open', req)

File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain

result = func(*args)

File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 111] Connection refused>

2018-03-12 04:35:26,013 [CRITICAL] [HARD] [HDFS] [namenode_ha_health] (NameNode High Availability Health) Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[], Unknown['srv-02-01-b07.XXX.YYY.io:50070']

2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [namenode_directory_status] (NameNode Directory Status) [Alert][namenode_directory_status] Unable to extract JSON from JMX response

2018-03-12 04:35:29,992 [UNKNOWN] [SOFT] [HDFS] [datanode_health_summary] (DataNode Health Summary) [Alert][datanode_health_summary] Unable to extract JSON from JMX response

2018-03-12 04:36:26,019 [UNKNOWN] [HARD] [HDFS] [namenode_last_checkpoint] (NameNode Last Checkpoint) Traceback (most recent call last):

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 203, in execute

"LastCheckpointTime", connection_timeout))

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 246, in get_value_from_jmx

response = urllib2.urlopen(query, timeout=connection_timeout)

File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "/usr/lib64/python2.7/urllib2.py", line 431, in open

response = self._open(req, data)

File "/usr/lib64/python2.7/urllib2.py", line 449, in _open

'_open', req)

File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain

result = func(*args)

File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 111] Connection refused>

2018-03-12 04:36:26,019 [CRITICAL] [HARD] [HDFS] [namenode_webui] (NameNode Web UI) Connection failed to http://srv-02-01-b07.XXX.YYY.io:50070 (<urlopen error [Errno 111] Connection refused>)

2018-03-12 04:36:26,069 [UNKNOWN] [HARD] [HDFS] [upgrade_finalized_state] (HDFS Upgrade Finalized State) Traceback (most recent call last):

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 140, in execute

"UpgradeFinalized"))

File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 169, in get_value_from_jmx

response = urllib2.urlopen(query, timeout=int(CONNECTION_TIMEOUT_DEFAULT))

File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen

return opener.open(url, data, timeout)

File "/usr/lib64/python2.7/urllib2.py", line 431, in open

response = self._open(req, data)

File "/usr/lib64/python2.7/urllib2.py", line 449, in _open

'_open', req)

File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain

result = func(*args)

File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open

raise URLError(err)

URLError: <urlopen error [Errno 111] Connection refused>

2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_pending_deletion_blocks] (HDFS Pending Deletion Blocks) [Alert][namenode_hdfs_pending_deletion_blocks] Unable to extract JSON from JMX response

2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_blocks_health] (NameNode Blocks Health) [Alert][namenode_hdfs_blocks_health] Unable to extract JSON from JMX response

2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_capacity_utilization] (HDFS Capacity Utilization) [Alert][namenode_hdfs_capacity_utilization] Unable to extract JSON from JMX response

2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_rpc_latency] (NameNode RPC Latency) [Alert][namenode_rpc_latency] Unable to extract JSON from JMX response

2018-03-12 04:36:26,077 [UNKNOWN] [HARD] [HDFS] [namenode_directory_status] (NameNode Directory Status) [Alert][namenode_directory_status] Unable to extract JSON from JMX response

2018-03-12 04:36:29,998 [UNKNOWN] [HARD] [HDFS] [datanode_health_summary] (DataNode Health Summary) [Alert][datanode_health_summary] Unable to extract JSON from JMX response

2018-03-12 04:38:25,024 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_pending_deletion_blocks] (HDFS Pending Deletion Blocks) [Alert][namenode_hdfs_pending_deletion_blocks] Unable to extract JSON from JMX response

2018-03-12 04:38:25,024 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_blocks_health] (NameNode Blocks Health) [Alert][namenode_hdfs_blocks_health] Unable to extract JSON from JMX response

2018-03-12 04:38:25,146 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_capacity_utilization] (HDFS Capacity Utilization) [Alert][namenode_hdfs_capacity_utilization] Unable to extract JSON from JMX response

2018-03-12 04:38:25,158 [UNKNOWN] [HARD] [HDFS] [namenode_rpc_latency] (NameNode RPC Latency) [Alert][namenode_rpc_latency] Unable to extract JSON from JMX response

2018-03-12 16:49:31,994 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'],

2018-03-17 04:17:29,023 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'],

/var/log/ambari-server.log


12 Mar 2018 04:04:35,129INFO [pool-17-thread-1] MetricSinkWriteShardHostnameHashingStrategy:42 - Calculated collector shard srv-02-05-b07.XXX.YYY.io based on hostname: srv-02-01-b07.XXX.YYY.io

12 Mar 2018 04:35:50,996INFO [ambari-heartbeat-processor-0] HeartbeatProcessor:609 - State of service component NAMENODE of service HDFS of cluster pipeline has changed from STARTED to INSTALLED at host srv-02-01-b07.XXX.YYY.io according to STATUS_COMMAND report

12 Mar 2018 04:35:53,857INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 1 pending alert notices about to be dispatched...

12 Mar 2018 04:35:53,873INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[1], Unknown[0]}

12 Mar 2018 04:37:53,875INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 5 pending alert notices about to be dispatched...

12 Mar 2018 04:37:53,927INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[1], Unknown[4]}

12 Mar 2018 04:39:53,928INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 4 pending alert notices about to be dispatched...

12 Mar 2018 04:39:53,967INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[0], Unknown[4]}

12 Mar 2018 04:43:53,968INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 1 pending alert notices about to be dispatched...

12 Mar 2018 04:43:53,989INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[0], Unknown[1]}

12 Mar 2018 05:14:38,691INFO [pool-17-thread-1] MetricSinkWriteShardHostnameHashingStrategy:42 - Calculated collector shard srv-02-05-b07.XXX.YYY.io based on hostname: srv-02-01-b07.XXX.YYY.io

1 REPLY 1
Highlighted

Re: Namenode and secondary namenode are flapping and checkpoint error afterward

Rising Star

@John Liao , thanks for reporting the issue. Have you checked the hdfs audit log during that period of time when the NN flipped?

It is likely you have a spike in the number of RPC requests (e.g., large recursive delete, getContentSummary call, etc) toward the active NN that overwhelm the call queue. As a result of that, ZKFC cannot get any response within the configured timeout (default=45s) and triggered failover.

Please check @Arpit Agarwal's previous articles on best practice to scale your namenode to handle similar situations:

https://community.hortonworks.com/articles/43838/scaling-the-hdfs-namenode-part-1.html

Don't have an account?
Coming from Hortonworks? Activate your account here