question Re: HA - Name Nodes don't get started in Support Questions

HA - Name Nodes don't get started

mabilgen — Sat, 24 Dec 2022 00:10:25 GMT

we have hadoop cluster based on HDP from hortonworks version HDP 3.1.0.0-78

cluster include 2 namenode services when one is the standby namenode and the second is the active namenode , all machines in the cluster are CentOs 7.9 version , and we dont' see any problem on OS level

also cluster include 87 data node machines ( 9 admin nodes various master services working on them ). All physical machines , around 7 PB data volume, 75% is full.

the story begin with neither NN1 nor NN2 work in same time , I mean active , standby . They were working for more that 2-3 years with out issue. Last 2-3 months they dont get running in the same time. When I look at he NN logs , after one NN get active second one getting running and in some point 3 Journal Nodes get an exception and NN2 get down.

/************************************************************

have seen NN failover occur after Journal Node got Exception from 3 JN.

see below logs. {Truncated unnecessary logs}

P.Q.161.12 : lvs-hdadm-102 (NN1, JN1). 
P.Q.161.13 : lvs-hdadm-103 (NN2, JN2) 
P.Q.161.14 : lvs-hdadm-104 (JN3)

2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR

2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPCs epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR

2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR
After 3 Journal Node (JN) return write error it got FATAL error
2022-12-06 10:38:11,080 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2196111639))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
P.Q.161.13:8485: IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR
and shutdown the NN
2022-12-06 10:38:11,082 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2196111639
2022-12-06 10:38:11,095 INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2196111639))
2022-12-06 10:38:11,132 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-103.corp.ebay.com/P.Q.161.13

Re: HA - Name Nodes don't get started

Kartik_Agarwal — Sat, 24 Dec 2022 04:56:33 GMT

@mabilgen Cause

Root cause is due to condition where JournalNode identifies that promised epoch from the NameNode is not newest, on the contrary it sees that the RPC request from NameNode has lesser epoch value than the locally stored promised epoch . Therefore is throws the warning - "IPC's epoch 155 is less than the last promised epoch 156" .
JournalNode therefore reject this RPC request from NameNode to avoid the split-brain. It will accept the RPC request from the NameNode which is send the RPC request with the newest epoch.

This can be caused due to various reasons in the environment.

1.) any big job.
2.) any network issue.
3.) not enough resources on node

Instructions

To resolve the issue we need to identify if network is stable in the cluster. Also it is good to see if this issue is happening regularly within any specific NameNode. If yes, then there could some network hardware issue or resource issue with this NameNode.
Few tuneables could be:

a.) We can try to raise the number of dfs.datanode.max.transfer.threads from 4K to 16K and observe the performance on the cluster.
b.) Also, we can raise the heapsize of NameNode to a higher value if there is too many GC activities or too long Full GC.

Re: HA - Name Nodes don't get started

mabilgen — Tue, 27 Dec 2022 00:21:07 GMT

it is happening on both NN. in this last time, it happened on standby (NN1). In the NN logs, I have seen JVM Pause warning message every 15 minutes frequency about ~80 sec pause.

the other main issue NN doesn't response to hdfs dfs cli commands or take too long to response.

############ P.Q.161.12 : lvs-hdadm-102 (NN1, JN1) NameNode Logs

2022-12-24 18:58:16,804 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 28062ms
No GCs detected
2022-12-24 19:12:37,133 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854108ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=854065ms
2022-12-24 19:13:06,386 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 27250ms
No GCs detected
2022-12-24 19:27:28,115 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 828717ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=829081ms
2022-12-24 19:27:57,400 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 26782ms
No GCs detected

############
2022-12-24 19:28:12,812 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: starting log segment 2205245876 failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=null))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:
P.Q.161.12:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR
P.Q.161.13:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR
P.Q.161.14:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR
2022-12-24 19:28:12,841 INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: Error: starting log segment 2205245876 failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=null))
2022-12-24 19:28:12,845 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:

Dont see any error on network stats;

[root@lvs-hdadm-102 ~]# ifconfig -a
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000
inet 10.229.161.12 netmask 255.255.254.0 broadcast 10.229.161.255
ether 5c:b9:01:89:1b:5c txqueuelen 1000 (Ethernet)
RX packets 1777351730 bytes 2383624637851 (2.1 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2093695795 bytes 1376445489747 (1.2 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

[root@lvs-hdadm-102 ~]# netstat -i
Kernel Interface table
Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
bond0 9000 1777439746 0 0 0 2093764662 0 0 0 BMmRU
eno1 1500 0 0 0 0 0 0 0 0 BMU
eno2 1500 0 0 0 0 0 0 0 0 BMU
eno3 1500 0 0 0 0 0 0 0 0 BMU
eno4 1500 0 0 0 0 0 0 0 0 BMU
eno49 9000 1717336519 0 0 0 2093764662 0 0 0 BMsRU
eno50 9000 60103257 0 0 0 0 0 0 0 BMsRU
ens1f0 1500 0 0 0 0 0 0 0 0 BMU
ens1f1 1500 0 0 0 0 0 0 0 0 BMU
lo 65536 41174312 0 0 0 41174312 0 0 0 LRU

And both NN are working with too high CPU utilization , mostly it has 100 % cpu utilization.

## NN2:
[root@lvs-hdadm-103 ~]# top -p 37939
top - 10:35:17 up 33 days, 21:59, 1 user, load average: 1.33, 1.31, 1.41
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
%Cpu(s): 1.8 us, 0.8 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13173120+total, 5608700 free, 11516615+used, 10956348 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 13160080 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
37939 hdfs 20 0 100.0g 98.2g 15820 S 100.3 78.2 6848:11 java

Mem config:

-Xms98304m -Xmx98304m
Total Mem on each NN : 128 GB

I have changed dfs.datanode.max.transfer.threads from 4K to 16K and did restart all required services, NN, etc. take some times. Will let know if same issue happens again.

Re: HA - Name Nodes don't get started

Kartik_Agarwal — Tue, 27 Dec 2022 05:54:54 GMT

@mabilgen Thanks for the update keep us posted if this issue occurs again.

If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped.

Re: HA - Name Nodes don't get started

mabilgen — Tue, 27 Dec 2022 23:31:22 GMT

A few findings;

After changed dfs.datanode.max.transfer.threads from 4K to 16K , I did restart the all required services and than one day later I started second NN2, but once NN2 got started NN1 was got shutdown (see logs). Same issue happened again. Get client.QuorumJournalManager exception and shutdown NN.

I have noticed, namenode process use 100% CPU load and most of the time it cannot response hdfs cli commands. I am wondering is there any way to split namenode process a few other thread?

### NN2 lvs-hdadm-103 started on 12/27 at 11:09 am 
2022-12-26 17:06:02,973 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-103.domain.com/P.Q.161.13
2022-12-27 11:09:41,059 INFO namenode.NameNode (LogAdapter.java:info(51)) - STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = lvs-hdadm-103.domain.com/P.Q.161.13
STARTUP_MSG: args = []
STARTUP_MSG: version = 3.1.1.3.1.0.0-78

And these are the NN1 logs;


[root@lvs-hdadm-102 ~]# grep -Ei "SHUTDOWN_MSG|WARN client|FATAL" /var/log/hadoop/hdfs/hadoop-hdfs-namenode-lvs-hdadm-102.domain.com.log
2022-12-27 11:24:16,681 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 4836ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.14:8485
2022-12-27 11:24:16,889 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5045ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.12:8485
2022-12-27 11:24:16,964 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5119ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.13:8485
2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.
2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.
2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.
2022-12-27 12:56:45,323 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2206626947))
2022-12-27 12:56:45,325 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2206626947
2022-12-27 12:56:45,340 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-102.domain.com/P.Q.161.12


[root@lvs-hdadm-102 ~]# grep -Ei "SHUTDOWN_MSG|WARN|FATAL" /var/log/hadoop/hdfs/hadoop-hdfs-namenode-lvs-hdadm-102.domain.com.log

2022-12-27 09:50:37,381 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5d2 is not valid for DN d233a1ba-de3e-448d-8151-daf51eb7f287, because the DN is not in the pending set.
## above message got x 88 times

2022-12-27 09:50:37,447 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set.
2022-12-27 09:50:39,308 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/app_logs_archive/archive_v3/app_logs/4D69DE10-C9C0-4A5C-BB68-039B1B1F7FCC/1614643200_1603584000/1614643200_1603584000/db_1609183032_1609057428_9023_4D69DE10-C9C0-4A5C-BB68-039B1B1F7FCC/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287624 for block blk_-9223372035367162256_128679723
2022-12-27 09:50:41,098 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 4 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-12-27 09:50:41,357 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-12-27 09:50:41,411 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology

2022-12-27 10:05:13,822 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 867425ms
2022-12-27 10:05:36,935 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 21110ms
2022-12-27 10:07:37,448 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set.
2022-12-27 10:07:37,897 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/app_logs_archive/archive_v3/app_logs/2F4DDF15-6E57-4584-AB79-F34C6479E3F8/1609113600_1607731200/1608249600_1608076800/db_1608204596_1608125285_2372_2F4DDF15-6E57-4584-AB79-F34C6479E3F8/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287627 for block blk_-9223372035389569520_127278003
2022-12-27 10:07:38,298 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-12-27 10:07:38,311 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-12-27 10:07:38,328 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-12-27 10:07:38,338 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-12-27 10:07:38,344 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 7 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-12-27 10:07:38,353 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology
2022-12-27 10:21:11,867 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause
2022-12-27 10:21:12,090 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 813036ms
2022-12-27 10:21:37,394 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23302ms
2022-12-27 10:36:35,325 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 859903ms
2022-12-27 10:37:00,238 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22905ms
2022-12-27 10:52:33,328 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 893538ms
2022-12-27 10:53:00,215 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 24865ms
2022-12-27 11:07:58,912 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 865649ms
2022-12-27 11:08:24,345 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23432ms
## from this point NN2 got started up at 11:09
2022-12-27 11:23:43,406 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 890551ms
2022-12-27 11:24:10,580 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 25173ms
2022-12-27 11:24:11,813 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:requestLease(230)) - DN 165757d4-6293-4808-a136-24d3a4d3c676 (P.Q.161.62:50010) requested a lease even though it wasn't yet registered. Registering now.
2022-12-27 11:24:11,886 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5d5 is not valid for DN b581acf8-f4d3-4c9b-9ef4-43faace4e7be, because the DN is not in the pending set.
## above message got x 116 times

2022-12-27 11:24:11,958 WARN hdfs.StateChange (FSDirRenameOp.java:validateRenameSource(560)) - DIR* FSDirectory.unprotectedRenameTo: rename source /ingest/splunk/hdr/network_archive/tmp/com.splunk.roll.Transactor-e4ff7a87-8b84-40c1-a1a2-d361eccb830f.tmp is not found.
2022-12-27 11:24:12,020 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/os_archive/archive_v3/os/E6527EDD-8E52-4B86-9ABD-D205AD24E2E8/1606348800_1603584000/1606348800_1603584000/db_1605485341_1604591209_1936_E6527EDD-8E52-4B86-9ABD-D205AD24E2E8/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287636 for block blk_-9223372035573309296_115793229
2022-12-27 11:24:16,681 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 4836ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.14:8485
2022-12-27 11:24:16,889 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5045ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.12:8485
2022-12-27 11:24:16,964 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5119ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.13:8485
2022-12-27 11:39:36,321 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 842594ms
2022-12-27 11:40:03,309 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 24986ms
2022-12-27 11:40:34,887 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set.
2022-12-27 11:40:34,961 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set.
2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(317)) - BR lease 0x9c43be53a3d0d608 is not valid for DN 165757d4-6293-4808-a136-24d3a4d3c676, because the lease has expired.
2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set.
2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5fc is not valid for DN 4638701d-0001-4086-9de6-dfc9fe1c67d7, because the DN is not in the pending set.
2022-12-27 11:40:34,966 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set.
2022-12-27 11:40:37,235 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/main_archive/archive_v3/main/441D58DC-A942-428A-958A-4D3DFC583008/1600819200_1599436800/1600819200_1600128000/db_1600627689_1600336233_1710_441D58DC-A942-428A-958A-4D3DFC583008/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287641 for block blk_-9223372035529627248_118523521
2022-12-27 11:40:38,717 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause
2022-12-27 11:55:45,850 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause
2022-12-27 11:55:46,047 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 881892ms
2022-12-27 11:56:10,972 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22918ms
2022-12-27 12:11:18,074 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 871503ms
2022-12-27 12:11:43,010 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22935ms
2022-12-27 12:26:24,884 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854865ms
2022-12-27 12:26:49,182 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22296ms
2022-12-27 12:41:11,944 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 836753ms
2022-12-27 12:41:36,208 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22261ms
2022-12-27 12:56:13,594 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854379ms
2022-12-27 12:56:37,767 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22172ms
2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.
2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.
2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.
2022-12-27 12:56:45,323 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2206626947))
2022-12-27 12:56:45,325 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2206626947
2022-12-27 12:56:45,340 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-102.domain.com/P.Q.161.12
[root@lvs-hdadm-102 ~]#

Re: HA - Name Nodes don't get started

mabilgen — Wed, 28 Dec 2022 00:52:14 GMT

Here is the Name Node JVM settings;

based on best practice on this page

/usr/jdk64/jdk1.8.0_112/bin/java
-Dproc_namenode -Dhdp.version=3.1.0.0-78
-Djava.net.preferIPv4Stack=true
-Dhdp.version=3.1.0.0-78
-Dhdfs.audit.logger=INFO,NullAppender
-server
-XX:ParallelGCThreads=8
-XX:+UseConcMarkSweepGC
-XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log
-XX:NewSize=12288m
-XX:MaxNewSize=12288m
## ---- This is missing on my config
-XX:PermSize=12288m -XX:MaxPermSize=24576m
## -----
-Xloggc:/var/log/hadoop/hdfs/gc.log-202212222050
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps
-XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSInitiatingOccupancyOnly
-Xms98304m -Xmx98304m
-Dhadoop.security.logger=INFO,DRFAS
-Dhdfs.audit.logger=INFO,DRFAAUDIT
-XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node"
-Dorg.mortbay.jetty.Request.maxFormContentSize=-1
-Dyarn.log.dir=/var/log/hadoop/hdfs
-Dyarn.log.file=hadoop-hdfs-namenode-lvs-hdadm-103.domain.com.log
-Dyarn.home.dir=/usr/hdp/3.1.0.0-78/hadoop-yarn
-Dyarn.root.logger=INFO,console
-Djava.library.path=:/usr/hdp/3.1.0.0-78/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.0.0-78/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.0.0-78/hadoop/lib/native
-Dhadoop.log.dir=/var/log/hadoop/hdfs
-Dhadoop.log.file=hadoop-hdfs-namenode-lvs-hdadm-103.domain.com.log
-Dhadoop.home.dir=/usr/hdp/3.1.0.0-78/hadoop
-Dhadoop.id.str=hdfs
-Dhadoop.root.logger=INFO,RFA
-Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.NameNode

Re: HA - Name Nodes don't get started

rki_ — Wed, 28 Dec 2022 18:04:14 GMT

HI @mabilgen , The main problem that you have on this cluster is lack of RAM on the host which is limited to 128GB. The Namenodes while startup will consume the allocated heap of 98GB that leaves 30GB of memory for any other processes running on this host.

When other processes also utilise this remaining 30GB of memory, you are seeing huge JVM pauses as the garbage collector is trying to de-reference the objects to free up the memory but this takes too much time that the Namenode gives up on the Journalnode Quorum and fails over.

As a Thumb rule, you should allocate 1GB of heap for 1 million blocks. So if there are more than 98million blocks on this cluster, then the current NN heap is not sufficient.

1) Try to lower the Total block count on the cluster by deleting any unwanted files or old snapshots.

2) If feasible, add more Physical RAM to the host

Any amount of tunings won't be helpful in this situation as the jvm pauses are too big to be managed by tunings. As such, you would need to either perform cleanup of HDFS or Add more RAM to NN hosts or Move Namenode to another Node which has higher RAM.

Re: HA - Name Nodes don't get started

mabilgen — Thu, 29 Dec 2022 00:23:04 GMT

Below are some info from cluster. So I need to increase MEM or delete some data from HDFS.

I am not sure if I can delete some data, since NN doesn't response most of the time to the cli commands. Anyway will try for both options.

I will let you know the results.

Thanks for reply @rki_ and @Kartik_Agarwal .

248,193,147 files and directories, 143,527,877 blocks (7,155 replicated blocks, 143,520,722 erasure coded block groups) = 391,721,024 total filesystem object(s). Heap Memory used 86.63 GB of 94.8 GB Heap Memory. Max Heap Memory is 94.8 GB.

Re: HA - Name Nodes don't get started

rki_ — Thu, 29 Dec 2022 08:04:10 GMT

@mabilgen you have 143 million blocks in the cluster and the NN heap is 95GB. This is why the NN is not holding up. You would need to bring the total block count to 90 million for NN to start working properly as the NN expects atleast 150GB of heap for 143 million blocks to work smoothly.

Re: HA - Name Nodes don't get started

mabilgen — Mon, 23 Jan 2023 23:12:29 GMT

Looks like memory upgrade from 128 GB to 256 GB and gave 192 GB to JVM heap would not be able to resolve my issue. same NN stopped 2 times and I started one more time . It got same Journal "FATAL namenode.FSEditLog " error. Lets see if it would happen 3rd time in same node or other.

Re: HA - Name Nodes don't get started

mabilgen — Wed, 25 Jan 2023 21:29:50 GMT

it looks like same node getting down every times. I have did total 4-5 times but NN1 is going down. I have compared JVM heap mem size between NN1 and NN2 but both are same, (getting config from Ambari correctly)

not sure what to try next???

Re: HA - Name Nodes don't get started

atonal — Thu, 26 Jan 2023 04:34:57 GMT

Hello,

Check your name node address in core-site.xml. Change to 50070 or 9000 and try

The default address of namenode web UI is http://localhost:50070/. You can open this address in your browser and check the namenode information. The default address of namenode server is hdfs://localhost:8020/. You can connect to it to access HDFS by HDFS api. The is the real service address.

Re: HA - Name Nodes don't get started

rki_ — Thu, 26 Jan 2023 05:26:12 GMT

If the same Node is getting down every time, it's worth checking the Memory utilization at the OS end. You can check the /var/log/messages of the NN host when the NN went down and check if the process is getting killed by an oom.