Member since
03-20-2018
1
Post
0
Kudos Received
0
Solutions
03-20-2018
05:12 PM
HDP: 2.6.2.14 HDFS: 2.7.3 We are seeing that our primary namenode and secondary namenode are flopping back and forth. The secondary namenode did not start back up and we had to do a manual force checkpoint to put the namenode back in a stable condition. I have extracted portions of the alerts log showing the flopping. A couple of things that concerns me is the "Upgrade Finalized State" and "Namenode RPC latency" error. Has anyone seen these errors before or know the cause of it. Thanks John ------- /var/log/ambari-alerts.log 2018-01-12 05:23:09,014 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'], 2018-02-05 20:39:02,025 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[], 2018-02-05 20:39:25,999 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[], 2018-02-07 16:38:09,017 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'], 2018-02-09 12:02:25,028 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[], 2018-02-09 12:03:03,008 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[], 2018-02-14 19:14:29,001 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'], 2018-02-14 19:15:06,012 Active['srv-02-01-b07.XXX.YYY.io:50070'], Standby[], 2018-02-19 17:23:12,016 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'], 2018-02-21 17:54:08,003 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'], 2018-02-24 05:21:07,026 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'], 2018-02-25 03:46:11,083 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'], 2018-03-08 16:18:10,995 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'], 2018-03-09 20:18:10,003 Active[], Standby['srv-02-01-b08.XXX.YYY.io:50070'], 2018-03-12 04:35:03,009 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[], 2018-03-12 04:35:26,013 Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[], 2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [namenode_last_checkpoint] (NameNode Last Checkpoint) Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 203, in execute "LastCheckpointTime", connection_timeout)) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 246, in get_value_from_jmx response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> 2018-03-12 04:35:26,013 [CRITICAL] [SOFT] [HDFS] [namenode_webui] (NameNode Web UI) Connection failed to http://srv-02-01-b07.XXX.YYY.io:50070 (<urlopen error [Errno 111] Connection refused>) 2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [upgrade_finalized_state] (HDFS Upgrade Finalized State) Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 140, in execute "UpgradeFinalized")) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 169, in get_value_from_jmx response = urllib2.urlopen(query, timeout=int(CONNECTION_TIMEOUT_DEFAULT)) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> 2018-03-12 04:35:26,013 [CRITICAL] [HARD] [HDFS] [namenode_ha_health] (NameNode High Availability Health) Active['srv-02-01-b08.XXX.YYY.io:50070'], Standby[], Unknown['srv-02-01-b07.XXX.YYY.io:50070'] 2018-03-12 04:35:26,013 [UNKNOWN] [SOFT] [HDFS] [namenode_directory_status] (NameNode Directory Status) [Alert][namenode_directory_status] Unable to extract JSON from JMX response 2018-03-12 04:35:29,992 [UNKNOWN] [SOFT] [HDFS] [datanode_health_summary] (DataNode Health Summary) [Alert][datanode_health_summary] Unable to extract JSON from JMX response 2018-03-12 04:36:26,019 [UNKNOWN] [HARD] [HDFS] [namenode_last_checkpoint] (NameNode Last Checkpoint) Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 203, in execute "LastCheckpointTime", connection_timeout)) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_checkpoint_time.py", line 246, in get_value_from_jmx response = urllib2.urlopen(query, timeout=connection_timeout) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> 2018-03-12 04:36:26,019 [CRITICAL] [HARD] [HDFS] [namenode_webui] (NameNode Web UI) Connection failed to http://srv-02-01-b07.XXX.YYY.io:50070 (<urlopen error [Errno 111] Connection refused>) 2018-03-12 04:36:26,069 [UNKNOWN] [HARD] [HDFS] [upgrade_finalized_state] (HDFS Upgrade Finalized State) Traceback (most recent call last): File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 140, in execute "UpgradeFinalized")) File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/alerts/alert_upgrade_finalized.py", line 169, in get_value_from_jmx response = urllib2.urlopen(query, timeout=int(CONNECTION_TIMEOUT_DEFAULT)) File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/usr/lib64/python2.7/urllib2.py", line 431, in open response = self._open(req, data) File "/usr/lib64/python2.7/urllib2.py", line 449, in _open '_open', req) File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain result = func(*args) File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open return self.do_open(httplib.HTTPConnection, req) File "/usr/lib64/python2.7/urllib2.py", line 1214, in do_open raise URLError(err) URLError: <urlopen error [Errno 111] Connection refused> 2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_pending_deletion_blocks] (HDFS Pending Deletion Blocks) [Alert][namenode_hdfs_pending_deletion_blocks] Unable to extract JSON from JMX response 2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_blocks_health] (NameNode Blocks Health) [Alert][namenode_hdfs_blocks_health] Unable to extract JSON from JMX response 2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_hdfs_capacity_utilization] (HDFS Capacity Utilization) [Alert][namenode_hdfs_capacity_utilization] Unable to extract JSON from JMX response 2018-03-12 04:36:26,077 [UNKNOWN] [SOFT] [HDFS] [namenode_rpc_latency] (NameNode RPC Latency) [Alert][namenode_rpc_latency] Unable to extract JSON from JMX response 2018-03-12 04:36:26,077 [UNKNOWN] [HARD] [HDFS] [namenode_directory_status] (NameNode Directory Status) [Alert][namenode_directory_status] Unable to extract JSON from JMX response 2018-03-12 04:36:29,998 [UNKNOWN] [HARD] [HDFS] [datanode_health_summary] (DataNode Health Summary) [Alert][datanode_health_summary] Unable to extract JSON from JMX response 2018-03-12 04:38:25,024 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_pending_deletion_blocks] (HDFS Pending Deletion Blocks) [Alert][namenode_hdfs_pending_deletion_blocks] Unable to extract JSON from JMX response 2018-03-12 04:38:25,024 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_blocks_health] (NameNode Blocks Health) [Alert][namenode_hdfs_blocks_health] Unable to extract JSON from JMX response 2018-03-12 04:38:25,146 [UNKNOWN] [HARD] [HDFS] [namenode_hdfs_capacity_utilization] (HDFS Capacity Utilization) [Alert][namenode_hdfs_capacity_utilization] Unable to extract JSON from JMX response 2018-03-12 04:38:25,158 [UNKNOWN] [HARD] [HDFS] [namenode_rpc_latency] (NameNode RPC Latency) [Alert][namenode_rpc_latency] Unable to extract JSON from JMX response 2018-03-12 16:49:31,994 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'], 2018-03-17 04:17:29,023 Active[], Standby['srv-02-01-b07.XXX.YYY.io:50070'], /var/log/ambari-server.log 12 Mar 2018 04:04:35,129INFO [pool-17-thread-1] MetricSinkWriteShardHostnameHashingStrategy:42 - Calculated collector shard srv-02-05-b07.XXX.YYY.io based on hostname: srv-02-01-b07.XXX.YYY.io 12 Mar 2018 04:35:50,996INFO [ambari-heartbeat-processor-0] HeartbeatProcessor:609 - State of service component NAMENODE of service HDFS of cluster pipeline has changed from STARTED to INSTALLED at host srv-02-01-b07.XXX.YYY.io according to STATUS_COMMAND report 12 Mar 2018 04:35:53,857INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 1 pending alert notices about to be dispatched... 12 Mar 2018 04:35:53,873INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[1], Unknown[0]} 12 Mar 2018 04:37:53,875INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 5 pending alert notices about to be dispatched... 12 Mar 2018 04:37:53,927INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[1], Unknown[4]} 12 Mar 2018 04:39:53,928INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 4 pending alert notices about to be dispatched... 12 Mar 2018 04:39:53,967INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[0], Unknown[4]} 12 Mar 2018 04:43:53,968INFO [AlertNoticeDispatchService RUNNING] AlertNoticeDispatchService:279 - There are 1 pending alert notices about to be dispatched... 12 Mar 2018 04:43:53,989INFO [alert-dispatch-58] EmailDispatcher:94 - Sending email: Notification{ type=ALERT, subject=Alert Summary: OK[0], Warning[0], Critical[0], Unknown[1]} 12 Mar 2018 05:14:38,691INFO [pool-17-thread-1] MetricSinkWriteShardHostnameHashingStrategy:42 - Calculated collector shard srv-02-05-b07.XXX.YYY.io based on hostname: srv-02-01-b07.XXX.YYY.io
... View more
Labels:
- Labels:
-
Apache Hadoop