I have come across one of the strange issue in our cluster where i see that suddenly all of the services turns out to be Yellow and says that heartbeat lost of that particular services. It seems to be intermittent the services sometimes turns to green and after sometime it again turns out to be yellow and says that heartbeat lost.
Ambari Version: Version126.96.36.199
Ambari -Agent version: ambari-agent-188.8.131.52-136
Please find the screenshot attached.heartbeat-lost.png
We have tried restarting ambari-server, ambari-agent, postgresql but did not help. We have checked the logs but did not find anything.
Can anyone please help me in providing the solution to get this fixed? Also would like to know what made to arise this issue suddenly?
Thanking in Advance..!!
1. What is the Size of your Cluster? If the cluster size is large then sometimes we need to tune the "agent.threadpool.size.max"
agent.threadpool.size.max" : property sets max number of threads used to process heartbeats from ambari agents. The default value for this property is "25". This basically indicates the size of the Jetty connection pool used for handling incoming Ambari Agent requests.
# grep 'agent.threadpool.size.max' /etc/ambari-server/conf/ambari.properties 50
For more detail on this please refer to: https://community.hortonworks.com/articles/131670/ambari-server-performance-tuning-troubleshooting-c...
2. If the heartbeat be coming back shortly (in few seconds) then another approach will be to increase the "Ambari Agent Heartbeat" interval from 2 minutes to bit more. Ambari UI --> Alerts --> Search for "Ambari Agent Heartbeat"
3. Please share the ambari-server.log and ambari-agent logs of the same time stamp when you notice the heartbeat lost ... so that we can review for any strange behaviour.
4. If the heartbeat lost is happening on a specific duration (time pattern) then we should check if any heavy load job is running on the agent host that might be causing the Agent to not send the heartbeat for few seconds.
Thanks for sharing the details.
Cluster Size: It is a 21 node cluster. We will try with the options that you have mentioned
Sharing the logs would be little difficult for me.. let me try my best.
you can please share me the link for similar kind of issues that can help me to figure it out
Hi @Gaurav Bapat ,
This error seems to be becuase of python version
can you please refer to following thread
I hope you issue is same.
I have python 2.7.5 installed, do I need to downgrade it or upgrade it??
Is the SSL error because of Heartbeat and also why does my Metron component fails??
Hi @Gaurav Bapat,
As you are using python 2.7.5
you might be hitting the same bug mentioned in about link.
You can refer to this Link . https://access.redhat.com/articles/2039753#controlling-certificate-verification-7 and try disabling the certificates.
hope this helps