<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver ) in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182281#M83183</link>
    <description>&lt;P&gt;so in case we verify the logs of gc by &lt;A href="http://gceasy.io/"&gt;http://gceasy.io/&lt;/A&gt; , and we see that Driver isn't doing full garbage collection , that what are the next steps that we need to do ?&lt;/P&gt;</description>
    <pubDate>Wed, 12 Sep 2018 23:16:27 GMT</pubDate>
    <dc:creator>mike_bronson7</dc:creator>
    <dc:date>2018-09-12T23:16:27Z</dc:date>
    <item>
      <title>Spark failure detection - why datanode not send heartbeat to the master machine ( driver )</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182277#M83179</link>
      <description>&lt;P&gt;as all know the &lt;STRONG&gt;heartbeat&lt;/STRONG&gt; is a signal sent periodically in order to indicate normal operation of the node or synchronize with other parts of the system&lt;/P&gt;&lt;P&gt;in our system we have 5 workers machine , while executes run on 3 of them&lt;/P&gt;&lt;P&gt;&lt;EM&gt;our system include 5 datanodes machines ( workers ) , and 3 master machines , hadoop version is 2.6.4 &lt;/EM&gt;&lt;/P&gt;&lt;P&gt;and thrift server install on the first master1 machine ( and driver is in master1 )&lt;/P&gt;&lt;P&gt;In Spark the heartbeats are the messages sent by executors ( from workers machines ) to the driver.( master1 machine ) the message is represented by case class org.apache.spark.Heartbeat &lt;/P&gt;&lt;P&gt;The message is then received by the driver through org.apache.spark.HeartbeatReceiver#receiveAndReply(context: RpcCallContext) method. The driver:&lt;/P&gt;&lt;P&gt;the main purpose of heartbeats consists on checking if given node is still alive ( from worker machine to master1 machine ) &lt;/P&gt;&lt;P&gt;The driver verifies it at fixed interval (defined in &lt;EM&gt;spark.network.timeoutInterval&lt;/EM&gt; entry) by sending ExpireDeadHosts message to itself. When the message is handled, the driver checks for the executors with no recent heartbeats.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;until now I explain the concept&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;We notice that the messages sent by the executor can not be delivered to the driver , and from the yarn logs we can see that warning&lt;/P&gt;&lt;PRE&gt;WARN executor.Executor: Issue communicating with driver in heartbeater&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;My question is - what could be the reasons that driver ( master1 machine ) not get the heartbeat from the workers machines&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 06 Sep 2018 01:34:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182277#M83179</guid>
      <dc:creator>mike_bronson7</dc:creator>
      <dc:date>2018-09-06T01:34:22Z</dc:date>
    </item>
    <item>
      <title>Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182278#M83180</link>
      <description>&lt;P&gt; &lt;A rel="user" href="https://community.cloudera.com/users/26229/uribarih.html" nodeid="26229"&gt;@Michael Bronson&lt;/A&gt; Check if Driver is doing full garbage collection or if there could be a network issue between executor or driver. You can check the gc pause times in the spark UI and also you can add the gc logs to be printed as part of the output of the driver and executors.&lt;/P&gt;&lt;P&gt;--conf "spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails" &lt;/P&gt;&lt;P&gt;--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"&lt;/P&gt;&lt;P&gt;HTH&lt;/P&gt;&lt;P&gt;*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.&lt;/P&gt;</description>
      <pubDate>Fri, 07 Sep 2018 19:54:05 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182278#M83180</guid>
      <dc:creator>falbani</dc:creator>
      <dc:date>2018-09-07T19:54:05Z</dc:date>
    </item>
    <item>
      <title>Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182279#M83181</link>
      <description>&lt;P&gt;@Falix , regarding to you answer - "Check if Driver is doing full garbage collection"  , please described how to do that?&lt;/P&gt;</description>
      <pubDate>Wed, 12 Sep 2018 13:48:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182279#M83181</guid>
      <dc:creator>mike_bronson7</dc:creator>
      <dc:date>2018-09-12T13:48:00Z</dc:date>
    </item>
    <item>
      <title>Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182280#M83182</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/26229/uribarih.html" nodeid="26229"&gt;@Michael Bronson&lt;/A&gt; Using Spark UI you can go to executor tab and there is a column with GC time. Also, by using the above configurations I shared the gc details will be printed as part of the log ouput. You can review those using any tool like &lt;A href="http://gceasy.io/" target="_blank"&gt;http://gceasy.io/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;HTH&lt;/P&gt;</description>
      <pubDate>Wed, 12 Sep 2018 20:46:13 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182280#M83182</guid>
      <dc:creator>falbani</dc:creator>
      <dc:date>2018-09-12T20:46:13Z</dc:date>
    </item>
    <item>
      <title>Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182281#M83183</link>
      <description>&lt;P&gt;so in case we verify the logs of gc by &lt;A href="http://gceasy.io/"&gt;http://gceasy.io/&lt;/A&gt; , and we see that Driver isn't doing full garbage collection , that what are the next steps that we need to do ?&lt;/P&gt;</description>
      <pubDate>Wed, 12 Sep 2018 23:16:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-failure-detection-why-datanode-not-send-heartbeat-to/m-p/182281#M83183</guid>
      <dc:creator>mike_bronson7</dc:creator>
      <dc:date>2018-09-12T23:16:27Z</dc:date>
    </item>
  </channel>
</rss>

