Support Questions
Find answers, ask questions, and share your expertise

frequently namenode1 is going to stale state(not responding). HDP2.2.8

frequently namenode1 is going to stale state(not responding). HDP2.2.8

2016-03-12 11:13:15,876 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.104:1019
2016-03-12 11:13:16,437 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(178)) - Rescanning after 30001 milliseconds
2016-03-12 11:13:24,006 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-03/137.201.94.104:1019
2016-03-12 11:13:24,007 WARN  hdfs.StateChange (FSDirectory.java:validateRenameSource(885)) - 
DIR* FSDirectory.unprotectedRenameTo: rename source /user/hdfsprod/2016-03-12T10-54-19-
288UTC_F4_MAM_7e8617ce-9ee3-4fc4-92b7-bd26a52a6f99.avro is not found.
2016-03-12 11:13:24,008 WARN  hdfs.StateChange (FSDirectory.java:validateRenameSource(885)) - 
DIR* FSDirectory.unprotectedRenameTo: rename source /user/hdfsprod/2016-03-12T10-54-19-
285UTC_F4_MAM_106c6413-a1a5-4e25-96aa-16fd3bedaef1.avro is not found.
2016-03-12 11:13:24,055 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 7617 
millisecond(s).
2016-03-12 11:13:24,056 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.129:1019
2016-03-12 11:13:31,251 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-06/137.201.94.129:1019
2016-03-12 11:13:31,253 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.232:1019
2016-03-12 11:13:35,652 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-10/137.201.94.232:1019
2016-03-12 11:13:35,653 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.154:1019
2016-03-12 11:13:43,419 INFO  util.JvmPauseMonitor (JvmPauseMonitor.java:run(195)) - Detected 
pause in JVM or host machine (eg GC): pause of approximately 4337ms
GC pool 'ParNew' had collection(s): count=1 time=4432ms
2016-03-12 11:13:44,557 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-07/137.201.94.154:1019
2016-03-12 11:13:44,558 INFO  delegation.AbstractDelegationTokenSecretManager 
(AbstractDelegationTokenSecretManager.java:createPassword(385)) - Creating password for 
identifier: HDFS_DELEGATION_TOKEN token 8298353 for hdfsprod, currentKey: 375
2016-03-12 11:13:44,559 INFO  delegation.AbstractDelegationTokenSecretManager 
(AbstractDelegationTokenSecretManager.java:createPassword(385)) - Creating password for 
identifier: HDFS_DELEGATION_TOKEN token 8298354 for hdfsprod, currentKey: 375
2016-03-12 11:13:44,560 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.199:1019
2016-03-12 11:13:46,438 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(178)) - Rescanning after 30001 milliseconds
2016-03-12 11:13:48,734 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-09/137.201.94.199:1019
2016-03-12 11:13:48,782 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 2343 
millisecond(s).
2016-03-12 11:13:48,784 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.159:1019
2016-03-12 11:13:53,581 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-07/137.201.94.159:1019
2016-03-12 11:13:53,582 INFO  delegation.AbstractDelegationTokenSecretManager 
(AbstractDelegationTokenSecretManager.java:createPassword(385)) - Creating password for 
identifier: HDFS_DELEGATION_TOKEN token 8298355 for hdfsprod, currentKey: 375
2016-03-12 11:13:53,583 INFO  namenode.FSEditLog (FSEditLog.java:printStatistics(695)) - Number 
of transactions: 1055 Total time for transactions(ms): 22 Number of transactions batched in 
Syncs: 34 Number of syncs: 0 SyncTimes(ms):
2016-03-12 11:13:53,583 INFO  delegation.AbstractDelegationTokenSecretManager 
(AbstractDelegationTokenSecretManager.java:createPassword(385)) - Creating password for 
identifier: HDFS_DELEGATION_TOKEN token 8298356 for hdfsprod, currentKey: 375
2016-03-12 11:13:53,584 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.195:1019
2016-03-12 11:13:57,198 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-09/137.201.94.195:1019
2016-03-12 11:13:57,203 INFO  delegation.AbstractDelegationTokenSecretManager 
(AbstractDelegationTokenSecretManager.java:createPassword(385)) - Creating password for 
identifier: HDFS_DELEGATION_TOKEN token 8298357 for hdfsprod, currentKey: 375
2016-03-12 11:13:57,205 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.242:1019
2016-03-12 11:14:00,182 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-10/137.201.94.242:1019
2016-03-12 11:14:16,439 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(178)) - Rescanning after 30001 milliseconds
2016-03-12 11:15:19,298 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.219:1019
2016-03-12 11:15:25,384 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-09/137.201.94.219:1019
2016-03-12 11:15:25,433 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 68994 
millisecond(s).
2016-03-12 11:15:25,433 INFO  blockmanagement.CacheReplicationMonitor 
(CacheReplicationMonitor.java:run(178)) - Rescanning after 68994 milliseconds
2016-03-12 11:15:25,434 WARN  hdfs.StateChange (FSDirectory.java:validateRenameSource(885)) - 
DIR* FSDirectory.unprotectedRenameTo: rename source 
/user/hdfsprod/fd_ees_run_data#IMFS.GDW.E3.EesRunDataPkg.20160312191242.20160312190608x996_20160
312190722x997.tsv.gz.md is not found.
2016-03-12 11:15:25,434 INFO  namenode.FSEditLog (FSEditLog.java:printStatistics(695)) - Number 
of transactions: 1057 Total time for transactions(ms): 22 Number of transactions batched in 
Syncs: 34 Number of syncs: 0 SyncTimes(ms):
2016-03-12 11:15:25,435 WARN  hdfs.StateChange (FSDirectory.java:validateRenameSource(885)) - 
DIR* FSDirectory.unprotectedRenameTo: rename source 
/user/hdfsprod/fd_ees_run_data#IMFS.GDW.E3.EesRunDataPkg.20160312191243.20160312190722x997_20160
312190836x998.tsv.gz.md is not found.
2016-03-12 11:15:25,436 INFO  delegation.AbstractDelegationTokenSecretManager 
(AbstractDelegationTokenSecretManager.java:createPassword(385)) - Creating password for 
identifier: HDFS_DELEGATION_TOKEN token 8298358 for hdfsprod, currentKey: 375
2016-03-12 11:15:25,437 INFO  delegation.AbstractDelegationTokenSecretManager 
(AbstractDelegationTokenSecretManager.java:createPassword(385)) - Creating password for 
identifier: HDFS_DELEGATION_TOKEN token 8298359 for hdfsprod, currentKey: 375
2016-03-12 11:15:25,438 INFO  hdfs.StateChange (DatanodeManager.java:removeDeadDatanode(574)) - 
BLOCK* removeDeadDatanode: lost heartbeat from 137.201.94.152:1019
2016-03-12 11:15:30,791 INFO  net.NetworkTopology (NetworkTopology.java:remove(488)) - Removing 
a node: /rack-07/137.201.94.152:1019
3 REPLIES 3

Re: frequently namenode1 is going to stale state(not responding). HDP2.2.8

Mentor

Can you check whether you have ntp setup? Also check the health of your datanodes. Some of them are reported to be dead.

Re: frequently namenode1 is going to stale state(not responding). HDP2.2.8

everything is good. But we are getting the same error and went to stale state.

Re: frequently namenode1 is going to stale state(not responding). HDP2.2.8

@mallikarjunarao m

There can be many reasons and I have experienced this when NN ran out of memory because multiple operations running at the same time.

Are you running HDFS balancer while this is happening?