<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: HA - Name Nodes don't get started in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362306#M238747</link>
    <description>&lt;P&gt;If the same Node is getting down every time, it's worth checking the Memory utilization at the OS end. You can check the /var/log/messages of the NN host when the NN went down and check if the process is getting killed by an oom.&lt;/P&gt;</description>
    <pubDate>Thu, 26 Jan 2023 05:26:12 GMT</pubDate>
    <dc:creator>rki_</dc:creator>
    <dc:date>2023-01-26T05:26:12Z</dc:date>
    <item>
      <title>HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360106#M238285</link>
      <description>&lt;P&gt;we have hadoop cluster based on HDP from hortonworks version HDP 3.1.0.0-78&lt;/P&gt;&lt;P&gt;cluster include 2 namenode services when one is the standby namenode and the second is the active namenode , all machines in the cluster are CentOs 7.9 version , and we dont'&amp;nbsp; see any problem on OS level&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;also cluster include&amp;nbsp; 87 data node machines ( 9 admin nodes various master services working on them ). All physical machines , around 7 PB data volume, 75% is full.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;the story begin with neither NN1 nor NN2 work in same time , I mean active , standby .&amp;nbsp; They were working for more that 2-3 years with out issue. Last 2-3 months they dont get running in the same time.&amp;nbsp; When I look at he NN logs , after one NN get active second one getting running and in some point 3 Journal Nodes get an exception and NN2 get down.&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;/&lt;/SPAN&gt;&lt;SPAN&gt;************************************************************&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;have seen NN failover occur after Journal Node got Exception from 3 JN.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;see below logs. {&lt;EM&gt;Truncated unnecessary logs&lt;/EM&gt;}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;PRE&gt;P.Q.161.12 : lvs-hdadm-102 (NN1, JN1). &lt;BR /&gt;P.Q.161.13 : lvs-hdadm-103 (NN2, JN2) &lt;BR /&gt;P.Q.161.14 : lvs-hdadm-104 (JN3)&lt;BR /&gt;&lt;BR /&gt;2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.&lt;BR /&gt;org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR&lt;BR /&gt;&lt;BR /&gt;2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.&lt;BR /&gt;org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPCs epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR&lt;BR /&gt;&lt;BR /&gt;2022-12-06 10:38:11,071 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2196111640-2196111640. Will try to write to this JN again after the next log roll.&lt;BR /&gt;org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR&lt;BR /&gt;After 3 Journal Node (JN) return write error it got FATAL error&lt;BR /&gt;2022-12-06 10:38:11,080 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2196111639))&lt;BR /&gt;org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:&lt;BR /&gt;P.Q.161.13:8485: IPC s epoch 176 is less than the last promised epoch 177 ; journal id: GISHortonDR&lt;BR /&gt;and shutdown the NN&lt;BR /&gt;2022-12-06 10:38:11,082 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2196111639&lt;BR /&gt;2022-12-06 10:38:11,095 INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2196111639))&lt;BR /&gt;2022-12-06 10:38:11,132 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:&lt;BR /&gt;/************************************************************&lt;BR /&gt;SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-103.corp.ebay.com/P.Q.161.13&lt;/PRE&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 24 Dec 2022 00:10:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360106#M238285</guid>
      <dc:creator>mabilgen</dc:creator>
      <dc:date>2022-12-24T00:10:25Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360110#M238288</link>
      <description>&lt;DIV class="cause"&gt;&lt;H2&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/99416"&gt;@mabilgen&lt;/a&gt;&amp;nbsp;Cause&lt;/H2&gt;&lt;P&gt;Root cause is due to condition&amp;nbsp; where JournalNode identifies that promised epoch from the NameNode is not newest, on the contrary it sees that the RPC request from NameNode has lesser epoch value than the locally stored promised epoch . Therefore is throws the warning - "IPC's epoch 155 is less than the last promised epoch 156" .&lt;BR /&gt;JournalNode therefore reject this RPC request from NameNode to avoid the split-brain. It will accept the RPC request from the NameNode which is send the RPC request with the newest epoch.&lt;/P&gt;&lt;P&gt;This can be caused due to various reasons in the environment.&lt;/P&gt;&lt;P&gt;1.) any big job.&lt;BR /&gt;2.) any network issue.&lt;BR /&gt;3.) not enough resources on node&lt;/P&gt;&lt;/DIV&gt;&lt;DIV class="instruction"&gt;&lt;H2&gt;Instructions&lt;/H2&gt;&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;To resolve the issue we need to identify if network is stable in the cluster. Also it is good to see if this issue is happening regularly within any specific NameNode. If yes, then there could some network hardware issue or resource issue with this NameNode.&lt;BR /&gt;Few tuneables could be:&lt;/P&gt;&lt;P&gt;a.) We can try to raise the number of dfs.datanode.max.transfer.threads from 4K to 16K and observe the performance on the cluster.&lt;BR /&gt;b.) Also, we can raise the heapsize of NameNode to a higher value if there is too many GC activities or too long Full GC.&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 24 Dec 2022 04:56:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360110#M238288</guid>
      <dc:creator>Kartik_Agarwal</dc:creator>
      <dc:date>2022-12-24T04:56:33Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360145#M238305</link>
      <description>&lt;P&gt;it is happening on both NN. in this last time, it happened on standby (NN1). In the NN logs, I have seen JVM Pause warning message every 15 minutes frequency about ~80 sec pause.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;the other main issue NN doesn't response to &lt;STRONG&gt;hdfs dfs cli&lt;/STRONG&gt; commands or take too long to response.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;############ P.Q.161.12 : lvs-hdadm-102 (NN1, JN1) NameNode Logs&lt;BR /&gt;&lt;BR /&gt;2022-12-24 18:58:16,804 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 28062ms&lt;BR /&gt;No GCs detected&lt;BR /&gt;2022-12-24 19:12:37,133 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854108ms&lt;BR /&gt;GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=854065ms&lt;BR /&gt;2022-12-24 19:13:06,386 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 27250ms&lt;BR /&gt;No GCs detected&lt;BR /&gt;2022-12-24 19:27:28,115 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 828717ms&lt;BR /&gt;GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=829081ms&lt;BR /&gt;2022-12-24 19:27:57,400 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 26782ms&lt;BR /&gt;No GCs detected&lt;BR /&gt;&lt;BR /&gt;############&lt;BR /&gt;2022-12-24 19:28:12,812 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: starting log segment 2205245876 failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=null))&lt;BR /&gt;org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:&lt;BR /&gt;P.Q.161.12:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR&lt;BR /&gt;P.Q.161.13:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR&lt;BR /&gt;P.Q.161.14:8485: IPC's epoch 205 is less than the last promised epoch 206 ; journal id: GISHortonDR&lt;BR /&gt;2022-12-24 19:28:12,841 INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: Error: starting log segment 2205245876 failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=null))&lt;BR /&gt;2022-12-24 19:28:12,845 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:&lt;/PRE&gt;&lt;P&gt;Dont see any error on network stats;&lt;/P&gt;&lt;PRE&gt;[root@lvs-hdadm-102 ~]# ifconfig -a&lt;BR /&gt;bond0: flags=5187&amp;lt;UP,BROADCAST,RUNNING,MASTER,MULTICAST&amp;gt; mtu 9000&lt;BR /&gt;inet 10.229.161.12 netmask 255.255.254.0 broadcast 10.229.161.255&lt;BR /&gt;ether 5c:b9:01:89:1b:5c txqueuelen 1000 (Ethernet)&lt;BR /&gt;RX packets 1777351730 bytes 2383624637851 (2.1 TiB)&lt;BR /&gt;RX errors 0 dropped 0 overruns 0 frame 0&lt;BR /&gt;TX packets 2093695795 bytes 1376445489747 (1.2 TiB)&lt;BR /&gt;TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;[root@lvs-hdadm-102 ~]# netstat -i&lt;BR /&gt;Kernel Interface table&lt;BR /&gt;Iface MTU RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg&lt;BR /&gt;bond0 9000 1777439746 0 0 0 2093764662 0 0 0 BMmRU&lt;BR /&gt;eno1 1500 0 0 0 0 0 0 0 0 BMU&lt;BR /&gt;eno2 1500 0 0 0 0 0 0 0 0 BMU&lt;BR /&gt;eno3 1500 0 0 0 0 0 0 0 0 BMU&lt;BR /&gt;eno4 1500 0 0 0 0 0 0 0 0 BMU&lt;BR /&gt;eno49 9000 1717336519 0 0 0 2093764662 0 0 0 BMsRU&lt;BR /&gt;eno50 9000 60103257 0 0 0 0 0 0 0 BMsRU&lt;BR /&gt;ens1f0 1500 0 0 0 0 0 0 0 0 BMU&lt;BR /&gt;ens1f1 1500 0 0 0 0 0 0 0 0 BMU&lt;BR /&gt;lo 65536 41174312 0 0 0 41174312 0 0 0 LRU&lt;/PRE&gt;&lt;P&gt;&lt;BR /&gt;And both NN are working with too high CPU utilization , mostly it has 100 % cpu utilization.&lt;/P&gt;&lt;PRE&gt;## NN2:&lt;BR /&gt;[root@lvs-hdadm-103 ~]# top -p 37939&lt;BR /&gt;top - 10:35:17 up 33 days, 21:59, 1 user, load average: 1.33, 1.31, 1.41&lt;BR /&gt;Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie&lt;BR /&gt;%Cpu(s): 1.8 us, 0.8 sy, 0.0 ni, 97.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st&lt;BR /&gt;KiB Mem : 13173120+total, 5608700 free, 11516615+used, 10956348 buff/cache&lt;BR /&gt;KiB Swap: 0 total, 0 free, 0 used. 13160080 avail Mem&lt;BR /&gt;&lt;BR /&gt;PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND&lt;BR /&gt;37939 hdfs 20 0 100.0g 98.2g 15820 S 100.3 78.2 6848:11 java&lt;BR /&gt;&lt;BR /&gt;&lt;/PRE&gt;&lt;P&gt;Mem config:&lt;/P&gt;&lt;PRE&gt;-Xms98304m -Xmx98304m&lt;BR /&gt;Total Mem on each NN : 128 GB&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have changed&amp;nbsp;&lt;SPAN&gt;&lt;STRONG&gt;dfs.datanode.max.transfer.threads&lt;/STRONG&gt; from 4K to &lt;STRONG&gt;16K&lt;/STRONG&gt; and did restart all required services, NN, etc. take some times. Will let know if same issue happens again.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Dec 2022 00:21:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360145#M238305</guid>
      <dc:creator>mabilgen</dc:creator>
      <dc:date>2022-12-27T00:21:07Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360149#M238308</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/99416"&gt;@mabilgen&lt;/a&gt;&amp;nbsp;Thanks for the update keep us posted if this issue occurs again.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;If you found that the provided solution(s) assisted you with your query, please take a moment to login and click &lt;FONT face="arial black,avant garde" size="5"&gt;&lt;EM&gt;&lt;STRONG&gt;&lt;FONT color="#FF6600"&gt;Accept as Solution&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/EM&gt;&lt;/FONT&gt; below each response that helped.&lt;/P&gt;</description>
      <pubDate>Tue, 27 Dec 2022 05:54:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360149#M238308</guid>
      <dc:creator>Kartik_Agarwal</dc:creator>
      <dc:date>2022-12-27T05:54:54Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360181#M238316</link>
      <description>&lt;P&gt;A few findings;&lt;/P&gt;&lt;P&gt;After changed&amp;nbsp;&lt;STRONG&gt;dfs.datanode.max.transfer.threads&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;from 4K to&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;16K&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp; , I did restart the all required services and than one day later I started second NN2,&amp;nbsp; but once NN2 got started NN1 was got shutdown (see logs).&amp;nbsp; Same issue happened again. Get client.QuorumJournalManager exception and shutdown NN.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I have noticed, namenode process use 100% CPU load and most of the time it cannot response hdfs cli commands. I am wondering is there any way to split namenode process a few other thread?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;### NN2 lvs-hdadm-103 started on 12/27 at 11:09 am &lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;2022-12-26 17:06:02,973 INFO namenode.NameNode (&lt;/SPAN&gt;&lt;SPAN&gt;LogAdapter&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;java&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;SPAN&gt;info&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;51&lt;/SPAN&gt;&lt;SPAN&gt;)) - SHUTDOWN_MSG:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-103.domain.com/P.Q.161.13&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;2022-12-27 11:09:41,059 INFO namenode.NameNode (&lt;/SPAN&gt;&lt;SPAN&gt;LogAdapter&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;java&lt;/SPAN&gt;&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;SPAN&gt;info&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;51&lt;/SPAN&gt;&lt;SPAN&gt;)) - STARTUP_MSG:&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;STARTUP_MSG: Starting NameNode&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;STARTUP_MSG: host = lvs-hdadm-103.domain.com/P.Q.161.13&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;STARTUP_MSG: args = []&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;STARTUP_MSG: version = 3.1.1.3.1.0.0-78&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;And these are the NN1 logs;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;BR /&gt;[root@lvs-hdadm-102 ~]# grep -Ei "SHUTDOWN_MSG|WARN client|FATAL" /var/log/hadoop/hdfs/hadoop-hdfs-namenode-lvs-hdadm-102.domain.com.log&lt;BR /&gt;2022-12-27 11:24:16,681 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 4836ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.14:8485&lt;BR /&gt;2022-12-27 11:24:16,889 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5045ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.12:8485&lt;BR /&gt;2022-12-27 11:24:16,964 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5119ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.13:8485&lt;BR /&gt;2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.&lt;BR /&gt;2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.&lt;BR /&gt;2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.&lt;BR /&gt;2022-12-27 12:56:45,323 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2206626947))&lt;BR /&gt;2022-12-27 12:56:45,325 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2206626947&lt;BR /&gt;2022-12-27 12:56:45,340 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:&lt;BR /&gt;SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-102.domain.com/P.Q.161.12&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;[root@lvs-hdadm-102 ~]# grep -Ei "SHUTDOWN_MSG|WARN|FATAL" /var/log/hadoop/hdfs/hadoop-hdfs-namenode-lvs-hdadm-102.domain.com.log&lt;BR /&gt;&lt;BR /&gt;2022-12-27 09:50:37,381 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5d2 is not valid for DN d233a1ba-de3e-448d-8151-daf51eb7f287, because the DN is not in the pending set.&lt;BR /&gt;## above message got x 88 times&lt;BR /&gt;&lt;BR /&gt;2022-12-27 09:50:37,447 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set.&lt;BR /&gt;2022-12-27 09:50:39,308 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/app_logs_archive/archive_v3/app_logs/4D69DE10-C9C0-4A5C-BB68-039B1B1F7FCC/1614643200_1603584000/1614643200_1603584000/db_1609183032_1609057428_9023_4D69DE10-C9C0-4A5C-BB68-039B1B1F7FCC/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287624 for block blk_-9223372035367162256_128679723&lt;BR /&gt;2022-12-27 09:50:41,098 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 4 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;2022-12-27 09:50:41,357 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;2022-12-27 09:50:41,411 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;&lt;BR /&gt;2022-12-27 10:05:13,822 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 867425ms&lt;BR /&gt;2022-12-27 10:05:36,935 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 21110ms&lt;BR /&gt;2022-12-27 10:07:37,448 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set.&lt;BR /&gt;2022-12-27 10:07:37,897 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/app_logs_archive/archive_v3/app_logs/2F4DDF15-6E57-4584-AB79-F34C6479E3F8/1609113600_1607731200/1608249600_1608076800/db_1608204596_1608125285_2372_2F4DDF15-6E57-4584-AB79-F34C6479E3F8/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287627 for block blk_-9223372035389569520_127278003&lt;BR /&gt;2022-12-27 10:07:38,298 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;2022-12-27 10:07:38,311 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;2022-12-27 10:07:38,328 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 6 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;2022-12-27 10:07:38,338 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;2022-12-27 10:07:38,344 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 7 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;2022-12-27 10:07:38,353 WARN blockmanagement.BlockPlacementPolicy (BlockPlacementPolicyDefault.java:chooseTarget(432)) - Failed to place enough replicas, still in need of 5 to reach 9 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology&lt;BR /&gt;2022-12-27 10:21:11,867 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause&lt;BR /&gt;2022-12-27 10:21:12,090 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 813036ms&lt;BR /&gt;2022-12-27 10:21:37,394 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23302ms&lt;BR /&gt;2022-12-27 10:36:35,325 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 859903ms&lt;BR /&gt;2022-12-27 10:37:00,238 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22905ms&lt;BR /&gt;2022-12-27 10:52:33,328 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 893538ms&lt;BR /&gt;2022-12-27 10:53:00,215 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 24865ms&lt;BR /&gt;2022-12-27 11:07:58,912 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 865649ms&lt;BR /&gt;2022-12-27 11:08:24,345 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 23432ms&lt;BR /&gt;## from this point NN2 got started up at 11:09&lt;BR /&gt;2022-12-27 11:23:43,406 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 890551ms&lt;BR /&gt;2022-12-27 11:24:10,580 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 25173ms&lt;BR /&gt;2022-12-27 11:24:11,813 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:requestLease(230)) - DN 165757d4-6293-4808-a136-24d3a4d3c676 (P.Q.161.62:50010) requested a lease even though it wasn't yet registered. Registering now.&lt;BR /&gt;2022-12-27 11:24:11,886 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5d5 is not valid for DN b581acf8-f4d3-4c9b-9ef4-43faace4e7be, because the DN is not in the pending set.&lt;BR /&gt;## above message got x 116 times&lt;BR /&gt;&lt;BR /&gt;2022-12-27 11:24:11,958 WARN hdfs.StateChange (FSDirRenameOp.java:validateRenameSource(560)) - DIR* FSDirectory.unprotectedRenameTo: rename source /ingest/splunk/hdr/network_archive/tmp/com.splunk.roll.Transactor-e4ff7a87-8b84-40c1-a1a2-d361eccb830f.tmp is not found.&lt;BR /&gt;2022-12-27 11:24:12,020 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/os_archive/archive_v3/os/E6527EDD-8E52-4B86-9ABD-D205AD24E2E8/1606348800_1603584000/1606348800_1603584000/db_1605485341_1604591209_1936_E6527EDD-8E52-4B86-9ABD-D205AD24E2E8/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287636 for block blk_-9223372035573309296_115793229&lt;BR /&gt;2022-12-27 11:24:16,681 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 4836ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.14:8485&lt;BR /&gt;2022-12-27 11:24:16,889 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5045ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.12:8485&lt;BR /&gt;2022-12-27 11:24:16,964 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(417)) - Took 5119ms to send a batch of 1 edits (226 bytes) to remote journal P.Q.161.13:8485&lt;BR /&gt;2022-12-27 11:39:36,321 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 842594ms&lt;BR /&gt;2022-12-27 11:40:03,309 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 24986ms&lt;BR /&gt;2022-12-27 11:40:34,887 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set.&lt;BR /&gt;2022-12-27 11:40:34,961 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5ec is not valid for DN 025adeaa-bce6-4a61-be73-8c66707084ba, because the DN is not in the pending set.&lt;BR /&gt;2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(317)) - BR lease 0x9c43be53a3d0d608 is not valid for DN 165757d4-6293-4808-a136-24d3a4d3c676, because the lease has expired.&lt;BR /&gt;2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set.&lt;BR /&gt;2022-12-27 11:40:34,965 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d5fc is not valid for DN 4638701d-0001-4086-9de6-dfc9fe1c67d7, because the DN is not in the pending set.&lt;BR /&gt;2022-12-27 11:40:34,966 WARN blockmanagement.BlockReportLeaseManager (BlockReportLeaseManager.java:checkLease(311)) - BR lease 0x9c43be53a3d0d607 is not valid for DN b8a82844-d2eb-4654-b365-a003212fc883, because the DN is not in the pending set.&lt;BR /&gt;2022-12-27 11:40:37,235 WARN hdfs.StateChange (FSNamesystem.java:internalReleaseLease(3441)) - DIR* NameSystem.internalReleaseLease: File /ingest/splunk/hdr/main_archive/archive_v3/main/441D58DC-A942-428A-958A-4D3DFC583008/1600819200_1599436800/1600819200_1600128000/db_1600627689_1600336233_1710_441D58DC-A942-428A-958A-4D3DFC583008/journal.gz has not been closed. Lease recovery is in progress. RecoveryId = 230287641 for block blk_-9223372035529627248_118523521&lt;BR /&gt;2022-12-27 11:40:38,717 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause&lt;BR /&gt;2022-12-27 11:55:45,850 WARN blockmanagement.HeartbeatManager (HeartbeatManager.java:run(462)) - Skipping next heartbeat scan due to excessive pause&lt;BR /&gt;2022-12-27 11:55:46,047 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 881892ms&lt;BR /&gt;2022-12-27 11:56:10,972 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22918ms&lt;BR /&gt;2022-12-27 12:11:18,074 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 871503ms&lt;BR /&gt;2022-12-27 12:11:43,010 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22935ms&lt;BR /&gt;2022-12-27 12:26:24,884 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854865ms&lt;BR /&gt;2022-12-27 12:26:49,182 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22296ms&lt;BR /&gt;2022-12-27 12:41:11,944 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 836753ms&lt;BR /&gt;2022-12-27 12:41:36,208 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22261ms&lt;BR /&gt;2022-12-27 12:56:13,594 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 854379ms&lt;BR /&gt;2022-12-27 12:56:37,767 WARN util.JvmPauseMonitor (JvmPauseMonitor.java:run(201)) - Detected pause in JVM or host machine (eg GC): pause of approximately 22172ms&lt;BR /&gt;2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.13:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.&lt;BR /&gt;2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.12:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.&lt;BR /&gt;2022-12-27 12:56:45,318 WARN client.QuorumJournalManager (IPCLoggerChannel.java:call(399)) - Remote journal P.Q.161.14:8485 failed to write txns 2206645777-2206645778. Will try to write to this JN again after the next log roll.&lt;BR /&gt;2022-12-27 12:56:45,323 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [P.Q.161.12:8485, P.Q.161.13:8485, P.Q.161.14:8485], stream=QuorumOutputStream starting at txid 2206626947))&lt;BR /&gt;2022-12-27 12:56:45,325 WARN client.QuorumJournalManager (QuorumOutputStream.java:abort(74)) - Aborting QuorumOutputStream starting at txid 2206626947&lt;BR /&gt;2022-12-27 12:56:45,340 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:&lt;BR /&gt;SHUTDOWN_MSG: Shutting down NameNode at lvs-hdadm-102.domain.com/P.Q.161.12&lt;BR /&gt;[root@lvs-hdadm-102 ~]#&lt;/PRE&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 27 Dec 2022 23:31:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360181#M238316</guid>
      <dc:creator>mabilgen</dc:creator>
      <dc:date>2022-12-27T23:31:22Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360182#M238317</link>
      <description>&lt;P&gt;Here is the Name Node JVM settings;&lt;/P&gt;&lt;P&gt;based on&amp;nbsp; best practice &lt;A href="https://community.cloudera.com/t5/Community-Articles/NameNode-Garbage-Collection-Configuration-Best-Practices-and/ta-p/245276" target="_self"&gt;on this page&amp;nbsp;&lt;/A&gt;&lt;/P&gt;&lt;DIV&gt;&lt;PRE&gt;/usr/jdk64/jdk1.8.0_112/bin/java&lt;BR /&gt;-Dproc_namenode -Dhdp.version=3.1.0.0-78&lt;BR /&gt;-Djava.net.preferIPv4Stack=true&lt;BR /&gt;-Dhdp.version=3.1.0.0-78&lt;BR /&gt;-Dhdfs.audit.logger=INFO,NullAppender&lt;BR /&gt;-server&lt;BR /&gt;-XX:ParallelGCThreads=8&lt;BR /&gt;-XX:+UseConcMarkSweepGC&lt;BR /&gt;-XX:ErrorFile=/var/log/hadoop/hdfs/hs_err_pid%p.log&lt;BR /&gt;-XX:NewSize=12288m&lt;BR /&gt;-XX:MaxNewSize=12288m&lt;BR /&gt;## ---- This is missing on my config&lt;BR /&gt;-XX:PermSize=12288m -XX:MaxPermSize=24576m&lt;BR /&gt;## -----&lt;BR /&gt;-Xloggc:/var/log/hadoop/hdfs/gc.log-202212222050&lt;BR /&gt;-verbose:gc&lt;BR /&gt;-XX:+PrintGCDetails&lt;BR /&gt;-XX:+PrintGCTimeStamps&lt;BR /&gt;-XX:+PrintGCDateStamps&lt;BR /&gt;-XX:CMSInitiatingOccupancyFraction=70&lt;BR /&gt;-XX:+UseCMSInitiatingOccupancyOnly&lt;BR /&gt;-Xms98304m -Xmx98304m&lt;BR /&gt;-Dhadoop.security.logger=INFO,DRFAS&lt;BR /&gt;-Dhdfs.audit.logger=INFO,DRFAAUDIT&lt;BR /&gt;-XX:OnOutOfMemoryError="/usr/hdp/current/hadoop-hdfs-namenode/bin/kill-name-node"&lt;BR /&gt;-Dorg.mortbay.jetty.Request.maxFormContentSize=-1&lt;BR /&gt;-Dyarn.log.dir=/var/log/hadoop/hdfs&lt;BR /&gt;-Dyarn.log.file=hadoop-hdfs-namenode-lvs-hdadm-103.domain.com.log&lt;BR /&gt;-Dyarn.home.dir=/usr/hdp/3.1.0.0-78/hadoop-yarn&lt;BR /&gt;-Dyarn.root.logger=INFO,console&lt;BR /&gt;-Djava.library.path=:/usr/hdp/3.1.0.0-78/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.0.0-78/hadoop/lib/native/Linux-amd64-64:/usr/hdp/3.1.0.0-78/hadoop/lib/native&lt;BR /&gt;-Dhadoop.log.dir=/var/log/hadoop/hdfs&lt;BR /&gt;-Dhadoop.log.file=hadoop-hdfs-namenode-lvs-hdadm-103.domain.com.log&lt;BR /&gt;-Dhadoop.home.dir=/usr/hdp/3.1.0.0-78/hadoop&lt;BR /&gt;-Dhadoop.id.str=hdfs&lt;BR /&gt;-Dhadoop.root.logger=INFO,RFA&lt;BR /&gt;-Dhadoop.policy.file=hadoop-policy.xml org.apache.hadoop.hdfs.server.namenode.NameNode&lt;/PRE&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 28 Dec 2022 00:52:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360182#M238317</guid>
      <dc:creator>mabilgen</dc:creator>
      <dc:date>2022-12-28T00:52:14Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360204#M238330</link>
      <description>&lt;P&gt;HI&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/99416"&gt;@mabilgen&lt;/a&gt;&amp;nbsp;, The main problem that you have on this cluster is lack of RAM on the host which is limited to 128GB. The Namenodes while startup will consume the allocated heap of 98GB that leaves 30GB of memory for any other processes running on this host.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When other processes also utilise this remaining 30GB of memory, you are seeing huge JVM pauses as the garbage collector is trying to de-reference the objects to free up the memory but this takes too much time that the Namenode gives up on the Journalnode Quorum and fails over.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As a Thumb rule, you should allocate 1GB of heap for 1 million blocks. So if there are more than 98million blocks on this cluster, then the current NN heap is not sufficient.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1) Try to lower the Total block count on the cluster by deleting any unwanted files or old snapshots.&lt;/P&gt;&lt;P&gt;2) If feasible, add more Physical RAM to the host&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any amount of tunings won't be helpful in this situation as the jvm pauses are too big to be managed by tunings. As such, you would need to either perform cleanup of HDFS or Add more RAM to NN hosts or Move Namenode to another Node which has higher RAM.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Dec 2022 18:04:14 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360204#M238330</guid>
      <dc:creator>rki_</dc:creator>
      <dc:date>2022-12-28T18:04:14Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360210#M238332</link>
      <description>&lt;P&gt;Below are some info from cluster. So I need to increase MEM or delete some data from HDFS.&lt;/P&gt;&lt;P&gt;I am not sure if I can delete some data, since NN doesn't response most of the time to the cli commands. Anyway will try for both options.&lt;/P&gt;&lt;P&gt;I will let you know the results.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for reply&amp;nbsp;&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/80393"&gt;@rki_&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/75395"&gt;@Kartik_Agarwal&lt;/a&gt;&amp;nbsp;.&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;248,193,147 files and directories, 143,527,877 blocks (7,155 replicated blocks, 143,520,722 erasure coded block groups) = 391,721,024 total filesystem object(s).

Heap Memory used 86.63 GB of 94.8 GB Heap Memory. Max Heap Memory is 94.8 GB.&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 29 Dec 2022 00:23:04 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360210#M238332</guid>
      <dc:creator>mabilgen</dc:creator>
      <dc:date>2022-12-29T00:23:04Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360221#M238335</link>
      <description>&lt;P&gt;&lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/99416"&gt;@mabilgen&lt;/a&gt;&amp;nbsp;you have 143 million blocks in the cluster and the NN heap is 95GB. This is why the NN is not holding up. You would need to bring the total block count to 90 million for NN to start working properly as the NN expects atleast 150GB of heap for 143 million blocks to work smoothly.&lt;/P&gt;</description>
      <pubDate>Thu, 29 Dec 2022 08:04:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/360221#M238335</guid>
      <dc:creator>rki_</dc:creator>
      <dc:date>2022-12-29T08:04:10Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362072#M238707</link>
      <description>&lt;P&gt;Looks like memory upgrade from 128 GB to 256 GB and gave 192 GB to JVM heap would not be able to resolve my issue. same NN stopped 2 times and I started one more time . It got same Journal "FATAL namenode.FSEditLog " error. Lets see if it would happen 3rd time in same node or other.&lt;/P&gt;</description>
      <pubDate>Mon, 23 Jan 2023 23:12:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362072#M238707</guid>
      <dc:creator>mabilgen</dc:creator>
      <dc:date>2023-01-23T23:12:29Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362292#M238742</link>
      <description>&lt;P&gt;it looks like&amp;nbsp; same node getting down every times. I have did total 4-5 times but NN1 is going down. I have compared&amp;nbsp; JVM heap mem size between NN1 and NN2 but both are same, (getting config from Ambari correctly)&lt;/P&gt;&lt;P&gt;not sure what to try next???&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jan 2023 21:29:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362292#M238742</guid>
      <dc:creator>mabilgen</dc:creator>
      <dc:date>2023-01-25T21:29:50Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362304#M238745</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Check your name node address in core-site.xml. Change to 50070 or 9000 and try&lt;/P&gt;&lt;P&gt;The default address of namenode web UI is &lt;A href="http://localhost:50070/" target="_blank"&gt;http://localhost:50070/&lt;/A&gt;. You can open this address in your browser and check the namenode information. The default address of namenode server is hdfs://localhost:8020/. You can connect to it to access HDFS by HDFS api. The is the real service address.&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jan 2023 04:34:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362304#M238745</guid>
      <dc:creator>atonal</dc:creator>
      <dc:date>2023-01-26T04:34:57Z</dc:date>
    </item>
    <item>
      <title>Re: HA - Name Nodes don't get started</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362306#M238747</link>
      <description>&lt;P&gt;If the same Node is getting down every time, it's worth checking the Memory utilization at the OS end. You can check the /var/log/messages of the NN host when the NN went down and check if the process is getting killed by an oom.&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jan 2023 05:26:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HA-Name-Nodes-don-t-get-started/m-p/362306#M238747</guid>
      <dc:creator>rki_</dc:creator>
      <dc:date>2023-01-26T05:26:12Z</dc:date>
    </item>
  </channel>
</rss>

