Support Questions
Find answers, ask questions, and share your expertise

Standby Namenode in HA setup keeps going down

New Contributor

Hi,

 

I have a HA Cloudera setup. Primary namenode is up and standby namenode keeps going down in a few seconds after restarting. 

 

I was facing job failure issue in production and the below error was displaying in job error logs

 

The directory item limit of /user/spark/applicationHistory is exceeded: limit=1048576 items=1048576. so I had moved some old files which was 5 years old from /user/spark/applicationHistory to other location and did a rolling restart of hdfs service from cloudera manager and job started running. but few days later the standby namenode failure issue started. Please let me know how to resolve the issue. 

 

I have tried the below steps but still facing the same issue:

 

1. Put Active NN in safemode
2. Do a save namespace operation on Active NN
3. Leave Safemode
4. Login to Standby NN
5. hdfs namenode -bootstrapStandby -force
6. Start the failed standby Namenode.

Logs from failed namenode server

 

Datanodes logs out file

failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

 

Namenode out file log:

 

FATAL org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown error encountered while tailing edits. Shutting down standby NN.
java.io.IOException: java.lang.IllegalStateException: Cannot skip to less than the current value (=346057041), where newValue=346057040
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.resetLastInodeId(FSNamesystem.java:657)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:280)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:140)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:848)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:829)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:360)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1900)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:442)
at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
Caused by: java.lang.IllegalStateException: Cannot skip to less than the current value (=346057041), where newValue=346057040
at org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.resetLastInodeId(FSNamesystem.java:655)
.13 more
2022-03-08 02:11:50,893 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2022-03-08 02:11:50,895 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************

 

Journal node out log:

 

2022-03-07 16:42:32,655 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /opt/hadoop/dfs/jn/bbda1-prod-cdh-01-ns/current/edits_inprogress_0000000015143419542 -> /opt/hadoop/dfs/jn/bbda1-prod-cdh-01-ns/current/edits_0000000015143419542-0000000015145668341
2022-03-07 17:11:46,618 INFO org.apache.hadoop.hdfs.server.common.Storage: Purging no-longer needed file 15140407066
2022-03-07 17:11:46,630 INFO org.apache.hadoop.hdfs.server.common.Storage: Purging no-longer needed file 15139990404
2022-03-07 17:12:37,716 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /opt/hadoop/dfs/jn/bbda1-prod-cdh-01-ns/current/edits_inprogress_0000000015145668342 -> /opt/hadoop/dfs/jn/bbda1-prod-cdh-01-ns/current/edits_0000000015145668342-0000000015145759436
2022-03-07 19:43:48,992 WARN org.apache.hadoop.hdfs.qjournal.server.Journal: Sync of transaction range 15146089648-15146089648 took 1311ms
2022-03-07 22:40:51,859 WARN org.apache.hadoop.hdfs.qjournal.server.Journal: Sync of transaction range 15146467897-15146467897 took 1119ms
2022-03-08 02:11:48,661 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /opt/hadoop/dfs/jn/bbda1-prod-cdh-01-ns/current/edits_inprogress_0000000015145759437 -> /opt/hadoop/dfs/jn/bbda1-prod-cdh-01-ns/current/edits_0000000015145759437-0000000015146939052
2022-03-08 02:39:00,995 WARN org.apache.hadoop.hdfs.qjournal.server.Journal: Sync of transaction range 15148810390-15148810519 took 1044ms
2022-03-08 02:42:32,734 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /opt/hadoop/dfs/jn/bbda1-prod-cdh-01-ns/current/edits_inprogress_0000000015146939053 -> /opt/hadoop/dfs/jn/bbda1-prod-cdh-01-ns/current/edits_0000000015146939053-0000000015149060700

 

Thanks

 

0 REPLIES 0
; ;