About mabilgen

mabilgen · ‎01-25-2023

it looks like same node getting down every times. I have did total 4-5 times but NN1 is going down. I have compared JVM heap mem size between NN1 and NN2 but both are same, (getting config from Ambari correctly) not sure what to try next???

mabilgen · ‎01-23-2023

Looks like memory upgrade from 128 GB to 256 GB and gave 192 GB to JVM heap would not be able to resolve my issue. same NN stopped 2 times and I started one more time . It got same Journal "FATAL namenode.FSEditLog " error. Lets see if it would happen 3rd time in same node or other.

mabilgen · ‎12-28-2022

Below are some info from cluster. So I need to increase MEM or delete some data from HDFS. I am not sure if I can delete some data, since NN doesn't response most of the time to the cli commands. Anyway will try for both options. I will let you know the results. Thanks for reply @rki_ and @Kartik_Agarwal . 248,193,147 files and directories, 143,527,877 blocks (7,155 replicated blocks, 143,520,722 erasure coded block groups) = 391,721,024 total filesystem object(s). Heap Memory used 86.63 GB of 94.8 GB Heap Memory. Max Heap Memory is 94.8 GB.

mabilgen · ‎12-27-2022

Here is the Name Node JVM settings; based on best practice on this page /usr/jdk64/jdk1.8.0_112/bin/java -Dproc_namenode -Dhdp.version=3.1.0.0-78 -Djava.net.preferIPv4Stack=true -Dhdp.version=3.1.0.0-78 -Dhdfs.audit.logger=INFO,NullAppender -server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log -XX:NewSize=12288m -XX:MaxNewSize=12288m ## ---- This is missing on my config -XX:PermSize=12288m -XX:MaxPermSize=24576m ## ----- -Xloggc:/var/log/hadoop/hdfs/gc.log-202212222050 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms98304m -Xmx98304m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT -XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node" -Dorg.mortbay.jetty.Request.maxFormContentSize=-1 -Dyarn.log.dir=/var/log/hadoop/hdfs -Dyarn.log.file=hadoop-hdfs-namenode-lvs-hdadm-103.domain.com.log -Dyarn.home.dir=/usr/hdp/3.1.0.0-78/hadoop-yarn -Dyarn.root.logger=INFO,console -Djava.library.path=:/usr/hdp/3.1.0.0-78/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.0.0-78/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.0.0-78/hadoop/lib/native -Dhadoop.log.dir=/var/log/hadoop/hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-lvs-hdadm-103.domain.com.log -Dhadoop.home.dir=/usr/hdp/3.1.0.0-78/hadoop -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.NameNode

mabilgen · ‎12-27-2022

A few findings; After changed dfs.datanode.max.transfer.threads from 4K to 16K , I did restart the all required services and than one day later I started second NN2, but once NN2 got started NN1 was got shutdown (see logs). Same issue happened again. Get client.QuorumJournalManager exception and shutdown NN. I have noticed, namenode process use 100% CPU load and most of the time it cannot response hdfs cli commands. I am wondering is there any way to split namenode process a few other thread? ### NN2 lvs-hdadm-103 started on 12/27 at 11:09 am 2022-12-26 17:06:02,973 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG: SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-103.domain.com/P.Q.161.13 2022-12-27 11:09:41,059 INFO namenode.NameNode (LogAdapter.java:info(51)) - STARTUP_MSG: STARTUP_MSG: Starting NameNode STARTUP_MSG: host = lvs-hdadm-103.domain.com/P.Q.161.13 STARTUP_MSG: args = [] STARTUP_MSG: version = 3.1.1.3.1.0.0-78 And these are the NN1 logs; [root@lvs-hdadm-102 ~]# grep -Ei "SHUTDOWN_MSG|WARN client|FATAL" /var/log/hadoop/hdfs/hadoop-hdfs-namenode-lvs-hdadm-102.domain.com.log 2022-12-27 11:24:16,681 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 4836ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.14:8485 2022-12-27 11:24:16,889 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5045ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.12:8485 2022-12-27 11:24:16,964 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5119ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.13:8485 2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll. 2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll. 2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll. 2022-12-27 12:56:45,323 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2206626947)) 2022-12-27 12:56:45,325 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2206626947 2022-12-27 12:56:45,340 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG: SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-102.domain.com/P.Q.161.12 [root@lvs-hdadm-102 ~]# grep -Ei "SHUTDOWN_MSG|WARN|FATAL" /var/log/hadoop/hdfs/hadoop-hdfs-namenode-lvs-hdadm-102.domain.com.log 2022-12-27 09:50:37,381 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5d2 is not valid for DN d233a1ba-de3e-448d-8151-daf51eb7f287, because the DN is not in the pending set. ## above message got x 88 times 2022-12-27 09:50:37,447 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set. 2022-12-27 09:50:39,308 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/app_logs_archive/archive_v3/app_logs/4D69DE10-C9C0-4A5C-BB68-039B1B1F7FCC/1614643200_1603584000/1614643200_1603584000/db_1609183032_1609057428_9023_4D69DE10-C9C0-4A5C-BB68-039B1B1F7FCC/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287624 for block blk_-9223372035367162256_128679723 2022-12-27 09:50:41,098 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 4 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 09:50:41,357 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 09:50:41,411 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 10:05:13,822 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 867425ms 2022-12-27 10:05:36,935 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 21110ms 2022-12-27 10:07:37,448 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set. 2022-12-27 10:07:37,897 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/app_logs_archive/archive_v3/app_logs/2F4DDF15-6E57-4584-AB79-F34C6479E3F8/1609113600_1607731200/1608249600_1608076800/db_1608204596_1608125285_2372_2F4DDF15-6E57-4584-AB79-F34C6479E3F8/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287627 for block blk_-9223372035389569520_127278003 2022-12-27 10:07:38,298 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 10:07:38,311 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 10:07:38,328 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 10:07:38,338 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 10:07:38,344 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 7 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 10:07:38,353 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology 2022-12-27 10:21:11,867 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause 2022-12-27 10:21:12,090 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 813036ms 2022-12-27 10:21:37,394 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23302ms 2022-12-27 10:36:35,325 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 859903ms 2022-12-27 10:37:00,238 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22905ms 2022-12-27 10:52:33,328 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 893538ms 2022-12-27 10:53:00,215 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 24865ms 2022-12-27 11:07:58,912 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 865649ms 2022-12-27 11:08:24,345 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23432ms ## from this point NN2 got started up at 11:09 2022-12-27 11:23:43,406 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 890551ms 2022-12-27 11:24:10,580 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 25173ms 2022-12-27 11:24:11,813 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:requestLease(230)) - DN 165757d4-6293-4808-a136-24d3a4d3c676 (P.Q.161.62:50010) requested a lease even though it wasn't yet registered. Registering now. 2022-12-27 11:24:11,886 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5d5 is not valid for DN b581acf8-f4d3-4c9b-9ef4-43faace4e7be, because the DN is not in the pending set. ## above message got x 116 times 2022-12-27 11:24:11,958 WARN hdfs.StateChange (FSDirRenameOp.java:validateRenameSource(560)) - DIR* FSDirectory.unprotectedRenameTo: rename source /ingest/splunk/hdr/network_archive/tmp/com.splunk.roll.Transactor-e4ff7a87-8b84-40c1-a1a2-d361eccb830f.tmp is not found. 2022-12-27 11:24:12,020 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/os_archive/archive_v3/os/E6527EDD-8E52-4B86-9ABD-D205AD24E2E8/1606348800_1603584000/1606348800_1603584000/db_1605485341_1604591209_1936_E6527EDD-8E52-4B86-9ABD-D205AD24E2E8/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287636 for block blk_-9223372035573309296_115793229 2022-12-27 11:24:16,681 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 4836ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.14:8485 2022-12-27 11:24:16,889 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5045ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.12:8485 2022-12-27 11:24:16,964 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5119ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.13:8485 2022-12-27 11:39:36,321 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 842594ms 2022-12-27 11:40:03,309 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 24986ms 2022-12-27 11:40:34,887 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set. 2022-12-27 11:40:34,961 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set. 2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(317)) - BR lease 0x9c43be53a3d0d608 is not valid for DN 165757d4-6293-4808-a136-24d3a4d3c676, because the lease has expired. 2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set. 2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5fc is not valid for DN 4638701d-0001-4086-9de6-dfc9fe1c67d7, because the DN is not in the pending set. 2022-12-27 11:40:34,966 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set. 2022-12-27 11:40:37,235 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/main_archive/archive_v3/main/441D58DC-A942-428A-958A-4D3DFC583008/1600819200_1599436800/1600819200_1600128000/db_1600627689_1600336233_1710_441D58DC-A942-428A-958A-4D3DFC583008/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287641 for block blk_-9223372035529627248_118523521 2022-12-27 11:40:38,717 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause 2022-12-27 11:55:45,850 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause 2022-12-27 11:55:46,047 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 881892ms 2022-12-27 11:56:10,972 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22918ms 2022-12-27 12:11:18,074 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 871503ms 2022-12-27 12:11:43,010 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22935ms 2022-12-27 12:26:24,884 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854865ms 2022-12-27 12:26:49,182 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22296ms 2022-12-27 12:41:11,944 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 836753ms 2022-12-27 12:41:36,208 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22261ms 2022-12-27 12:56:13,594 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854379ms 2022-12-27 12:56:37,767 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22172ms 2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll. 2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll. 2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll. 2022-12-27 12:56:45,323 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2206626947)) 2022-12-27 12:56:45,325 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2206626947 2022-12-27 12:56:45,340 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG: SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-102.domain.com/P.Q.161.12 [root@lvs-hdadm-102 ~]#

mabilgen · ‎12-26-2022

it is happening on both NN. in this last time, it happened on standby (NN1). In the NN logs, I have seen JVM Pause warning message every 15 minutes frequency about ~80 sec pause. the other main issue NN doesn't response to hdfs dfs cli commands or take too long to response. ############ P.Q.161.12 : lvs-hdadm-102 (NN1, JN1) NameNode Logs 2022-12-24 18:58:16,804 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 28062ms No GCs detected 2022-12-24 19:12:37,133 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854108ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=854065ms 2022-12-24 19:13:06,386 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 27250ms No GCs detected 2022-12-24 19:27:28,115 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 828717ms GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=829081ms 2022-12-24 19:27:57,400 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 26782ms No GCs detected ############ 2022-12-24 19:28:12,812 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: starting log segment 2205245876 failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=null)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown: P.Q.161.12:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR P.Q.161.13:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR P.Q.161.14:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR 2022-12-24 19:28:12,841 INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: Error: starting log segment 2205245876 failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=null)) 2022-12-24 19:28:12,845 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG: Dont see any error on network stats; [root@lvs-hdadm-102 ~]# ifconfig -a bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 9000 inet 10.229.161.12 netmask 255.255.254.0 broadcast 10.229.161.255 ether 5c:b9:01:89:1b:5c txqueuelen 1000 (Ethernet) RX packets 1777351730 bytes 2383624637851 (2.1 TiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2093695795 bytes 1376445489747 (1.2 TiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 [root@lvs-hdadm-102 ~]# netstat -i Kernel Interface table Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg bond0 9000 1777439746 0 0 0 2093764662 0 0 0 BMmRU eno1 1500 0 0 0 0 0 0 0 0 BMU eno2 1500 0 0 0 0 0 0 0 0 BMU eno3 1500 0 0 0 0 0 0 0 0 BMU eno4 1500 0 0 0 0 0 0 0 0 BMU eno49 9000 1717336519 0 0 0 2093764662 0 0 0 BMsRU eno50 9000 60103257 0 0 0 0 0 0 0 BMsRU ens1f0 1500 0 0 0 0 0 0 0 0 BMU ens1f1 1500 0 0 0 0 0 0 0 0 BMU lo 65536 41174312 0 0 0 41174312 0 0 0 LRU And both NN are working with too high CPU utilization , mostly it has 100 % cpu utilization. ## NN2: [root@lvs-hdadm-103 ~]# top -p 37939 top - 10:35:17 up 33 days, 21:59, 1 user, load average: 1.33, 1.31, 1.41 Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.8 us, 0.8 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 13173120+total, 5608700 free, 11516615+used, 10956348 buff/cache KiB Swap: 0 total, 0 free, 0 used. 13160080 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 37939 hdfs 20 0 100.0g 98.2g 15820 S 100.3 78.2 6848:11 java Mem config: -Xms98304m -Xmx98304m Total Mem on each NN : 128 GB I have changed dfs.datanode.max.transfer.threads from 4K to 16K and did restart all required services, NN, etc. take some times. Will let know if same issue happens again.

mabilgen · ‎12-23-2022

we have hadoop cluster based on HDP from hortonworks version HDP 3.1.0.0-78 cluster include 2 namenode services when one is the standby namenode and the second is the active namenode , all machines in the cluster are CentOs 7.9 version , and we dont' see any problem on OS level also cluster include 87 data node machines ( 9 admin nodes various master services working on them ). All physical machines , around 7 PB data volume, 75% is full. the story begin with neither NN1 nor NN2 work in same time , I mean active , standby . They were working for more that 2-3 years with out issue. Last 2-3 months they dont get running in the same time. When I look at he NN logs , after one NN get active second one getting running and in some point 3 Journal Nodes get an exception and NN2 get down. /************************************************************ have seen NN failover occur after Journal Node got Exception from 3 JN. see below logs. {Truncated unnecessary logs} P.Q.161.12 : lvs-hdadm-102 (NN1, JN1). P.Q.161.13 : lvs-hdadm-103 (NN2, JN2) P.Q.161.14 : lvs-hdadm-104 (JN3) 2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR 2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPCs epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR 2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR After 3 Journal Node (JN) return write error it got FATAL error 2022-12-06 10:38:11,080 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2196111639)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown: P.Q.161.13:8485: IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR and shutdown the NN 2022-12-06 10:38:11,082 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2196111639 2022-12-06 10:38:11,095 INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2196111639)) 2022-12-06 10:38:11,132 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-103.corp.ebay.com/P.Q.161.13

mabilgen · ‎12-23-2022

I have the same problem, but in my case JVM pause detections are happening every 15 mins and detecting between 28337ms, .. 853466ms pauses.

Online	Offline
Last Visited	‎02-20-2023 03:43 PM

Member Since	‎07-25-2022 06:47 PM
Last Visited	‎02-20-2023 03:43 PM
Posts	8
Kudos received	1

Cloudera Community

Re: HA - Name Nodes don't get started

Re: HA - Name Nodes don't get started

Re: HA - Name Nodes don't get started

Re: HA - Name Nodes don't get started

Re: HA - Name Nodes don't get started

Re: HA - Name Nodes don't get started

HA - Name Nodes don't get started

Re: JVM + still have Detected pause in JVM in spit...