Created on 12-23-2022 12:40 PM - edited 12-23-2022 04:10 PM
we have hadoop cluster based on HDP from hortonworks version HDP 3.1.0.0-78
cluster include 2 namenode services when one is the standby namenode and the second is the active namenode , all machines in the cluster are CentOs 7.9 version , and we dont' see any problem on OS level
also cluster include 87 data node machines ( 9 admin nodes various master services working on them ). All physical machines , around 7 PB data volume, 75% is full.
the story begin with neither NN1 nor NN2 work in same time , I mean active , standby . They were working for more that 2-3 years with out issue. Last 2-3 months they dont get running in the same time. When I look at he NN logs , after one NN get active second one getting running and in some point 3 Journal Nodes get an exception and NN2 get down.
P.Q.161.12 : lvs-hdadm-102 (NN1, JN1).
P.Q.161.13 : lvs-hdadm-103 (NN2, JN2)
P.Q.161.14 : lvs-hdadm-104 (JN3)
2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR
2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPCs epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR
2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR
After 3 Journal Node (JN) return write error it got FATAL error
2022-12-06 10:38:11,080 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2196111639))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
P.Q.161.13:8485: IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR
and shutdown the NN
2022-12-06 10:38:11,082 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2196111639
2022-12-06 10:38:11,095 INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2196111639))
2022-12-06 10:38:11,132 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-103.corp.ebay.com/P.Q.161.13
Created 12-29-2022 12:04 AM
@mabilgen you have 143 million blocks in the cluster and the NN heap is 95GB. This is why the NN is not holding up. You would need to bring the total block count to 90 million for NN to start working properly as the NN expects atleast 150GB of heap for 143 million blocks to work smoothly.
Created 01-25-2023 01:29 PM
it looks like same node getting down every times. I have did total 4-5 times but NN1 is going down. I have compared JVM heap mem size between NN1 and NN2 but both are same, (getting config from Ambari correctly)
not sure what to try next???
Created 01-25-2023 09:26 PM
If the same Node is getting down every time, it's worth checking the Memory utilization at the OS end. You can check the /var/log/messages of the NN host when the NN went down and check if the process is getting killed by an oom.
Created 01-25-2023 08:34 PM
Hello,
Check your name node address in core-site.xml. Change to 50070 or 9000 and try
The default address of namenode web UI is http://localhost:50070/. You can open this address in your browser and check the namenode information. The default address of namenode server is hdfs://localhost:8020/. You can connect to it to access HDFS by HDFS api. The is the real service address.