Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Encountered exception loading fsimage, NameNode is not formatted

avatar
Contributor

NameNode is crashed when trying to restart it via CM due to: 

INFO Lock on /data/disk1/dfs/nn/in_use.lock acquired by nodename 26396@ip-10-2-0-224.ec2.internal
Warning Encountered exception loading fsimage
java.io.IOException: NameNode is not formatted.
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:251)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1166)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:757)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:642)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:713)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:956)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:931)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1666)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1733)
Failed to start namenode.
java.io.IOException: NameNode is not formatted.
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:251)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1166)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:757)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:642)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:713)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:956)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:931)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1666)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1733)

 And when i try to sudo -u hdfs hdfs namenode -format on the nn i get this error:

23/11/30 12:21:35 ERROR namenode.NameNode: Failed to start namenode.
java.io.IOException: Running in secure mode, but config doesn't have a keytab
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:306)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1136)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1623)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1733)
23/11/30 12:21:35 INFO util.ExitUtil: Exiting with status 1: java.io.IOException: Running in secure mode, but config doesn't have a keytab
23/11/30 12:21:35 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-10-2-0-224.ec2.internal/10.2.0.224

 Please help me get the HDFS NN running

20 REPLIES 20

avatar
Contributor

Hi @George-Megre , When namenode itself is down , who is writing into edits_inprogress_0000000000011353525 as of Dec 20th ?

avatar
Contributor

Hey @Majeti ,

we have Hive Metastore server that looks like probably writing, but this Hive is sitting in other machine that connected to our master NN.

both, machines which are the masters have bad Health:

Bad : NameNode summary: ip-10-2-X-X.ec2.internal (Availability: Active, Health: Bad), ip-10-2-X-X.ec2.internal (Availability: Unknown, Health: Bad). This health test reflects the health of the active NameNode.

avatar
Contributor

Hi @George-Megre , I dont think HMS can write to edit log. Only Namenodes are allowed to write unless u r using some different plugin to do that. I am not too sure of that use case . Can you please run commands like below to find out who is writing to edit file ?

[root@c4265-node3 current]#  lsof /data/dfs/nn/current/edits_inprogress_0000000000003796588
COMMAND  PID USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
java    1133 hdfs  324u   REG   0,87  1048576 21715133 /data/dfs/nn/current/edits_inprogress_0000000000003796588
[root@c4265-node3 current]# ps -h  1133
   1133 ?        Sl     5:18 /usr/java/jdk1.8.0_232-cloudera/bin/java -Dproc_namenode -Dhdfs.audit.logger=INFO,RFAAUDIT -Dsecurity.audit.logger=INFO,RFAS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.conf=/etc/krb5.conf -Xms1073741824 -Xm
[root@c4265-node3 current]# 

 

avatar
Contributor

Hey @Majeti ,

/data/disk1/dfs$ sudo lsof jn/hdfs-cdp7/current/edits_inprogress_0000000000011353525
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 26394 hdfs 295u REG 259,1 25165824 14417971 jn/hdfs-cdp7/current/edits_inprogress_0000000000011353525

 

/data/disk1/dfs$ sudo ps -h 26394
26394 ? Sl 41:01 /usr/lib/jvm/java-8-openjdk/bin/java -Dproc_journalnode -Djava.net.preferIPv4Stack=true -Xms536870912 -Xmx536870912 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hdfs_hdfs-JOURNALNODE-4bdfe38e85923ad06e3472fcb152cc39_pid26394.hprof -XX:O

Thanks. 

avatar
Contributor
Then some other Namenode is talking to this journal node process ? Any way
you can find other Namenode which is active and also configured to talk to
this. ?

avatar
Contributor

Hey @Majeti ,

So we have 2 NN that sitting on different machines.

One of them is writing to to the other probably.

how can we fix the issue? 🙁

avatar
Contributor
Then you need to run bootstrap standby on non running Namenode or sync nn
datadir folder from running nn to down nn.

avatar
Contributor

Hello @Majeti ,

1. The previous error no longer appears, but we are now encountering a new error that is showing up in the logs, such as:

Request to collection [ranger_audits] failed due to (0) java.net.ConnectException: Connection refused (Connection refused), retry=5 commError=true errorCode=0 
failed to log audit event: {"repoType":1,"repo":"cm_hdfs","reqUser":"mapred","evtTime":"2023-07-11 18:09:40.656","access":"listStatus","resource":"/user/history/done/2022/08","resType":"path","action":"read","result":1,"agent":"hdfs","policy":-1,"reason":"/user/history/done/2022/08","enforcer":"hadoop-acl","cliIP":"10.2.0.49","agentHost":"ip-10-2-0-49.ec2.internal","logType":"RangerAudit","id":"792dc152-89cf-425d-978d-b50af625b1fe-0","seq_num":1,"event_count":1,"event_dur_ms":1,"tags":[],"additional_info":"{\"remote-ip-address\":10.2.0.49, \"forwarded-ip-addresses\":[], \"accessTypes\":[execute, read]","cluster_name":"BigID_cdp7"}
org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: No live SolrServers available to handle this request:[http://ip-10-2-0-224.ec2.internal:8993/solr/ranger_audits_shard1_replica_n1]
	at org.apache.solr.client.solrj.impl.CloudSolrClient.getRouteException(CloudSolrClient.java:125)
	at org.apache.solr.client.solrj.impl.CloudSolrClient.getRouteException(CloudSolrClient.java:46)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.directUpdate(BaseCloudSolrClient.java:559)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1046)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:906)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:838)
	at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:213)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71)
	at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85)
	at org.apache.ranger.audit.destination.SolrAuditDestination.lambda$addDocsToSolr$0(SolrAuditDestination.java:493)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.ranger.audit.utils.AbstractKerberosUser.doAs(AbstractKerberosUser.java:151)
	at org.apache.ranger.audit.utils.KerberosAction.execute(KerberosAction.java:70)
	at org.apache.ranger.audit.destination.SolrAuditDestination.addDocsToSolr(SolrAuditDestination.java:499)
	at org.apache.ranger.audit.destination.SolrAuditDestination.log(SolrAuditDestination.java:282)
	at org.apache.ranger.audit.provider.BaseAuditHandler.logJSON(BaseAuditHandler.java:192)
	at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:880)
	at org.apache.ranger.audit.queue.AuditFileSpool.runLogAudit(AuditFileSpool.java:827)
	at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:757)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://ip-10-2-0-224.ec2.internal:8993/solr/ranger_audits_shard1_replica_n1]
	at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:345)
	at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.lambda$directUpdate$0(BaseCloudSolrClient.java:533)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://ip-10-2-0-224.ec2.internal:8993/solr/ranger_audits_shard1_replica_n1
	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
	at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
	at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:319)
	... 6 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to ip-10-2-0-224.ec2.internal:8993 [ip-10-2-0-224.ec2.internal/10.2.0.224] failed: Connection refused (Connection refused)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
	at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
	... 10 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:607)
	at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
	... 20 more

2. In addition, we have Health Test issue, which might be connected to issue #1 (?):

GeorgeMegre_0-1704720817353.png

We kind of lost, our cluster is not working and we must have it work 🙁🙏

avatar
Contributor

Hi @George-Megre , Looks like your SBNN is not performing checkpoint. Can you perform one manual checkpoint now and see what's happening on SBNN side? 

Verify if Solr server and port are reachable ip-10-2-0-224.ec2.internal:8993.

avatar
Contributor

@Majeti can you direct me where is the SBNN and what should i verify there? 

Thanks!