Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

Solved Go to solution
Highlighted

Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

as all know the heartbeat is a signal sent periodically in order to indicate normal operation of the node or synchronize with other parts of the system

in our system we have 5 workers machine , while executes run on 3 of them

our system include 5 datanodes machines ( workers ) , and 3 master machines , hadoop version is 2.6.4

and thrift server install on the first master1 machine ( and driver is in master1 )

In Spark the heartbeats are the messages sent by executors ( from workers machines ) to the driver.( master1 machine ) the message is represented by case class org.apache.spark.Heartbeat

The message is then received by the driver through org.apache.spark.HeartbeatReceiver#receiveAndReply(context: RpcCallContext) method. The driver:

the main purpose of heartbeats consists on checking if given node is still alive ( from worker machine to master1 machine )

The driver verifies it at fixed interval (defined in spark.network.timeoutInterval entry) by sending ExpireDeadHosts message to itself. When the message is handled, the driver checks for the executors with no recent heartbeats.

until now I explain the concept

We notice that the messages sent by the executor can not be delivered to the driver , and from the yarn logs we can see that warning

WARN executor.Executor: Issue communicating with driver in heartbeater

My question is - what could be the reasons that driver ( master1 machine ) not get the heartbeat from the workers machines

Michael-Bronson
1 ACCEPTED SOLUTION

Accepted Solutions

Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

@Michael Bronson Check if Driver is doing full garbage collection or if there could be a network issue between executor or driver. You can check the gc pause times in the spark UI and also you can add the gc logs to be printed as part of the output of the driver and executors.

--conf "spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"

--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

4 REPLIES 4

Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

@Michael Bronson Check if Driver is doing full garbage collection or if there could be a network issue between executor or driver. You can check the gc pause times in the spark UI and also you can add the gc logs to be printed as part of the output of the driver and executors.

--conf "spark.driver.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"

--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails"

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

@Falix , regarding to you answer - "Check if Driver is doing full garbage collection" , please described how to do that?

Michael-Bronson

Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

@Michael Bronson Using Spark UI you can go to executor tab and there is a column with GC time. Also, by using the above configurations I shared the gc details will be printed as part of the log ouput. You can review those using any tool like http://gceasy.io/

HTH

Re: Spark failure detection - why datanode not send heartbeat to the master machine ( driver )

so in case we verify the logs of gc by http://gceasy.io/ , and we see that Driver isn't doing full garbage collection , that what are the next steps that we need to do ?

Michael-Bronson