<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Name nodes crash due to journal timeouts in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Name-nodes-crash-due-to-journal-timeouts/m-p/47386#M45463</link>
    <description>Eventually what I discovered is two things:&lt;BR /&gt;&lt;BR /&gt;* The NPE in GssKrb5Base seems to be just a sign that the connection was closed. I suspect this is just a consequence of the NN shutting down without closing a connection.&lt;BR /&gt;* Essentially the disk the journals was on was just too busy at certain times. By adjusting the disk storage this problem was resolved. Ultimately, I needed to have the YARN resource manager on a different disk than the name node and journal processes.</description>
    <pubDate>Tue, 15 Nov 2016 15:06:07 GMT</pubDate>
    <dc:creator>martinserrano</dc:creator>
    <dc:date>2016-11-15T15:06:07Z</dc:date>
    <item>
      <title>Name nodes crash due to journal timeouts</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Name-nodes-crash-due-to-journal-timeouts/m-p/47107#M45461</link>
      <description>&lt;P&gt;We are experiencing an issue with the following configuration combination:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;HA HDFS&lt;/LI&gt;
&lt;LI&gt;Kerberos&lt;/LI&gt;
&lt;LI&gt;TLS1 (hdfs data node protection&amp;nbsp;and rpc protection set to privacy)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;In this situation, the name nodes are eventually shutting down due to journal timeouts such as (there are numerous examples in our logs):&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;2016-11-07 11:41:58,556 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 119125 ms (timeout=120000 ms) for a response for getJournalState(). No responses yet.
2016-11-07 11:41:59,433 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.17.49.75:8485, 10.17.49.76:8485, 10.17.49.77:8485], stream=null))
java.io.IOException: Timed out waiting 120000ms for a quorum of nodes to respond.
	at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.createNewUniqueEpoch(QuorumJournalManager.java:183)
	at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.recoverUnfinalizedSegments(QuorumJournalManager.java:441)
	at org.apache.hadoop.hdfs.server.namenode.JournalSet$8.apply(JournalSet.java:624)&lt;/PRE&gt;
&lt;P&gt;Around the same time on the journal nodes, we see:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8485 caught an exception
java.lang.NullPointerException
        at com.sun.security.sasl.gsskerb.GssKrb5Base.wrap(GssKrb5Base.java:103)
        at org.apache.hadoop.ipc.Server.wrapWithSasl(Server.java:2436)
        at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2392)
        at org.apache.hadoop.ipc.Server.access$2500(Server.java:134)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2131)
2016-11-07 11:43:30,370 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 8485 caught an exception&lt;/PRE&gt;
&lt;P&gt;I don't know if these are related. &amp;nbsp;We've tried increasing the journal related timeouts, but that just seems to shift the problem around. &amp;nbsp;We are running 5.8.2 on an 8-node test cluster. &amp;nbsp;The journal processes (3) are running on different machines than the 2 name nodes. &amp;nbsp;Any pointers on how to debug this would be appreciated.&lt;/P&gt;</description>
      <pubDate>Mon, 07 Nov 2016 21:04:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Name-nodes-crash-due-to-journal-timeouts/m-p/47107#M45461</guid>
      <dc:creator>martinserrano</dc:creator>
      <dc:date>2016-11-07T21:04:46Z</dc:date>
    </item>
    <item>
      <title>Re: Name nodes crash due to journal timeouts</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Name-nodes-crash-due-to-journal-timeouts/m-p/47117#M45462</link>
      <description>I see that this implies that the security context for the SASL setup is null. I am turning up logging on com.sun.security.sasl to see if it illuminates anything.</description>
      <pubDate>Mon, 07 Nov 2016 20:49:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Name-nodes-crash-due-to-journal-timeouts/m-p/47117#M45462</guid>
      <dc:creator>martinserrano</dc:creator>
      <dc:date>2016-11-07T20:49:15Z</dc:date>
    </item>
    <item>
      <title>Re: Name nodes crash due to journal timeouts</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Name-nodes-crash-due-to-journal-timeouts/m-p/47386#M45463</link>
      <description>Eventually what I discovered is two things:&lt;BR /&gt;&lt;BR /&gt;* The NPE in GssKrb5Base seems to be just a sign that the connection was closed. I suspect this is just a consequence of the NN shutting down without closing a connection.&lt;BR /&gt;* Essentially the disk the journals was on was just too busy at certain times. By adjusting the disk storage this problem was resolved. Ultimately, I needed to have the YARN resource manager on a different disk than the name node and journal processes.</description>
      <pubDate>Tue, 15 Nov 2016 15:06:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Name-nodes-crash-due-to-journal-timeouts/m-p/47386#M45463</guid>
      <dc:creator>martinserrano</dc:creator>
      <dc:date>2016-11-15T15:06:07Z</dc:date>
    </item>
  </channel>
</rss>

