Created 11-30-2023 04:41 AM
NameNode is crashed when trying to restart it via CM due to:
INFO Lock on /data/disk1/dfs/nn/in_use.lock acquired by nodename 26396@ip-10-2-0-224.ec2.internal |
Warning Encountered exception loading fsimage java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:251) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1166) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:757) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:642) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:713) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:931) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1666) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1733)
Failed to start namenode. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:251) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1166) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:757) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:642) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:713) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:931) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1666) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1733) |
And when i try to sudo -u hdfs hdfs namenode -format on the nn i get this error:
23/11/30 12:21:35 ERROR namenode.NameNode: Failed to start namenode.
java.io.IOException: Running in secure mode, but config doesn't have a keytab
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:306)
at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1136)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1623)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1733)
23/11/30 12:21:35 INFO util.ExitUtil: Exiting with status 1: java.io.IOException: Running in secure mode, but config doesn't have a keytab
23/11/30 12:21:35 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ip-10-2-0-224.ec2.internal/10.2.0.224
Please help me get the HDFS NN running
Created 12-21-2023 08:02 AM
Hi @George-Megre , When namenode itself is down , who is writing into edits_inprogress_0000000000011353525 as of Dec 20th ?
Created on 12-22-2023 03:47 AM - edited 12-22-2023 03:51 AM
Hey @Majeti ,
we have Hive Metastore server that looks like probably writing, but this Hive is sitting in other machine that connected to our master NN.
both, machines which are the masters have bad Health:
Bad : NameNode summary: ip-10-2-X-X.ec2.internal (Availability: Active, Health: Bad), ip-10-2-X-X.ec2.internal (Availability: Unknown, Health: Bad). This health test reflects the health of the active NameNode.
Created 12-22-2023 04:23 AM
Hi @George-Megre , I dont think HMS can write to edit log. Only Namenodes are allowed to write unless u r using some different plugin to do that. I am not too sure of that use case . Can you please run commands like below to find out who is writing to edit file ?
[root@c4265-node3 current]# lsof /data/dfs/nn/current/edits_inprogress_0000000000003796588
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 1133 hdfs 324u REG 0,87 1048576 21715133 /data/dfs/nn/current/edits_inprogress_0000000000003796588
[root@c4265-node3 current]# ps -h 1133
1133 ? Sl 5:18 /usr/java/jdk1.8.0_232-cloudera/bin/java -Dproc_namenode -Dhdfs.audit.logger=INFO,RFAAUDIT -Dsecurity.audit.logger=INFO,RFAS -Djava.net.preferIPv4Stack=true -Djava.security.krb5.conf=/etc/krb5.conf -Xms1073741824 -Xm
[root@c4265-node3 current]#
Created 12-24-2023 12:33 AM
Hey @Majeti ,
/data/disk1/dfs$ sudo lsof jn/hdfs-cdp7/current/edits_inprogress_0000000000011353525
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 26394 hdfs 295u REG 259,1 25165824 14417971 jn/hdfs-cdp7/current/edits_inprogress_0000000000011353525
/data/disk1/dfs$ sudo ps -h 26394
26394 ? Sl 41:01 /usr/lib/jvm/java-8-openjdk/bin/java -Dproc_journalnode -Djava.net.preferIPv4Stack=true -Xms536870912 -Xmx536870912 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hdfs_hdfs-JOURNALNODE-4bdfe38e85923ad06e3472fcb152cc39_pid26394.hprof -XX:O
Thanks.
Created 12-24-2023 03:09 AM
Created on 12-24-2023 05:35 AM - edited 12-24-2023 07:14 AM
Hey @Majeti ,
So we have 2 NN that sitting on different machines.
One of them is writing to to the other probably.
how can we fix the issue? 🙁
Created 12-24-2023 08:23 AM
Created 01-08-2024 05:34 AM
Hello @Majeti ,
1. The previous error no longer appears, but we are now encountering a new error that is showing up in the logs, such as:
Request to collection [ranger_audits] failed due to (0) java.net.ConnectException: Connection refused (Connection refused), retry=5 commError=true errorCode=0 |
failed to log audit event: {"repoType":1,"repo":"cm_hdfs","reqUser":"mapred","evtTime":"2023-07-11 18:09:40.656","access":"listStatus","resource":"/user/history/done/2022/08","resType":"path","action":"read","result":1,"agent":"hdfs","policy":-1,"reason":"/user/history/done/2022/08","enforcer":"hadoop-acl","cliIP":"10.2.0.49","agentHost":"ip-10-2-0-49.ec2.internal","logType":"RangerAudit","id":"792dc152-89cf-425d-978d-b50af625b1fe-0","seq_num":1,"event_count":1,"event_dur_ms":1,"tags":[],"additional_info":"{\"remote-ip-address\":10.2.0.49, \"forwarded-ip-addresses\":[], \"accessTypes\":[execute, read]","cluster_name":"BigID_cdp7"} org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: No live SolrServers available to handle this request:[http://ip-10-2-0-224.ec2.internal:8993/solr/ranger_audits_shard1_replica_n1] at org.apache.solr.client.solrj.impl.CloudSolrClient.getRouteException(CloudSolrClient.java:125) at org.apache.solr.client.solrj.impl.CloudSolrClient.getRouteException(CloudSolrClient.java:46) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.directUpdate(BaseCloudSolrClient.java:559) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1046) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:906) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:971) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:838) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:213) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:106) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:71) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:85) at org.apache.ranger.audit.destination.SolrAuditDestination.lambda$addDocsToSolr$0(SolrAuditDestination.java:493) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.ranger.audit.utils.AbstractKerberosUser.doAs(AbstractKerberosUser.java:151) at org.apache.ranger.audit.utils.KerberosAction.execute(KerberosAction.java:70) at org.apache.ranger.audit.destination.SolrAuditDestination.addDocsToSolr(SolrAuditDestination.java:499) at org.apache.ranger.audit.destination.SolrAuditDestination.log(SolrAuditDestination.java:282) at org.apache.ranger.audit.provider.BaseAuditHandler.logJSON(BaseAuditHandler.java:192) at org.apache.ranger.audit.queue.AuditFileSpool.sendEvent(AuditFileSpool.java:880) at org.apache.ranger.audit.queue.AuditFileSpool.runLogAudit(AuditFileSpool.java:827) at org.apache.ranger.audit.queue.AuditFileSpool.run(AuditFileSpool.java:757) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request:[http://ip-10-2-0-224.ec2.internal:8993/solr/ranger_audits_shard1_replica_n1] at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:345) at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.lambda$directUpdate$0(BaseCloudSolrClient.java:533) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://ip-10-2-0-224.ec2.internal:8993/solr/ranger_audits_shard1_replica_n1 at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265) at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248) at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368) at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:319) ... 6 more Caused by: org.apache.http.conn.HttpHostConnectException: Connect to ip-10-2-0-224.ec2.internal:8993 [ip-10-2-0-224.ec2.internal/10.2.0.224] failed: Connection refused (Connection refused) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ... 10 more Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:607) at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ... 20 more
2. In addition, we have Health Test issue, which might be connected to issue #1 (?):
We kind of lost, our cluster is not working and we must have it work 🙁🙏
Created 01-08-2024 07:32 AM
Hi @George-Megre , Looks like your SBNN is not performing checkpoint. Can you perform one manual checkpoint now and see what's happening on SBNN side?
Verify if Solr server and port are reachable ip-10-2-0-224.ec2.internal:8993.
Created 01-08-2024 08:00 AM
@Majeti can you direct me where is the SBNN and what should i verify there?
Thanks!