Created 09-05-2018 06:34 PM
as all know the heartbeat is a signal sent periodically in order to indicate normal operation of the node or synchronize with other parts of the system
in our system we have 5 workers machine , while executes run on 3 of them
our system include 5 datanodes machines ( workers ) , and 3 master machines , hadoop version is 2.6.4
and thrift server install on the first master1 machine ( and driver is in master1 )
In Spark the heartbeats are the messages sent by executors ( from workers machines ) to the driver.( master1 machine ) the message is represented by case class org.apache.spark.Heartbeat
The message is then received by the driver through org.apache.spark.HeartbeatReceiver#receiveAndReply(context: RpcCallContext) method. The driver:
the main purpose of heartbeats consists on checking if given node is still alive ( from worker machine to master1 machine )
The driver verifies it at fixed interval (defined in spark.network.timeoutInterval entry) by sending ExpireDeadHosts message to itself. When the message is handled, the driver checks for the executors with no recent heartbeats.
until now I explain the concept
We notice that the messages sent by the executor can not be delivered to the driver , and from the yarn logs we can see that warning
WARN executor.Executor: Issue communicating with driver in heartbeater
My question is - what could be the reasons that driver ( master1 machine ) not get the heartbeat from the workers machines
Created 09-07-2018 12:54 PM
@Michael Bronson Check if Driver is doing full garbage collection or if there could be a network issue between executor or driver. You can check the gc pause times in the spark UI and also you can add the gc logs to be printed as part of the output of the driver and executors.
--conf "spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"
HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
Created 09-07-2018 12:54 PM
@Michael Bronson Check if Driver is doing full garbage collection or if there could be a network issue between executor or driver. You can check the gc pause times in the spark UI and also you can add the gc logs to be printed as part of the output of the driver and executors.
--conf "spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"
--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"
HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
Created 09-12-2018 06:48 AM
@Falix , regarding to you answer - "Check if Driver is doing full garbage collection" , please described how to do that?
Created 09-12-2018 01:46 PM
@Michael Bronson Using Spark UI you can go to executor tab and there is a column with GC time. Also, by using the above configurations I shared the gc details will be printed as part of the log ouput. You can review those using any tool like http://gceasy.io/
HTH
Created 09-12-2018 04:16 PM
so in case we verify the logs of gc by http://gceasy.io/ , and we see that Driver isn't doing full garbage collection , that what are the next steps that we need to do ?