Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Lost HeartBeats Ambari

Highlighted

Lost HeartBeats Ambari

New Contributor

Hello,

I don't really know why but I lost all heartbeats on the main node of my cluster.

2813-heartbeats.png

Do you know how I can solve this problem ? I already try to reboot manually the node.

38 REPLIES 38

Re: Lost HeartBeats Ambari

Mentor

please make sure agent is up on the node.

ambari-agent status

Re: Lost HeartBeats Ambari

@Arthur GREVIN

Ambari Version please?

Can you give me the out put for below commands on the node where lost heartbeat?

ps -ef | grep kPT

ambar-agent status

Re: Lost HeartBeats Ambari

New Contributor

I have version2.1.1

ps -ef | grep kPT gives nothing : root 8364 8276 0 17:43 pts/1 00:00:00 grep kPT

ambari-agent status gives :

Found ambari-agent PID: 2123

ambari-agent running.

Agent PID at: /var/run/ambari-agent/ambari-agent.pid

Agent out at: /var/log/ambari-agent/ambari-agent.out

Agent log at: /var/log/ambari-agent/ambari-agent.log

Re: Lost HeartBeats Ambari

Mentor

please provide logs for agent and server

Re: Lost HeartBeats Ambari

New Contributor

Agent log :

INFO 2016-03-16 10:18:31,247 NetUtil.py:59 - Connecting to https://dl-master:8440/ca ERROR 2016-03-16 10:18:31,414 NetUtil.py:77 - [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)

ERROR 2016-03-16 10:18:31,414 NetUtil.py:78 - SSLError: Failed to connect. Please check openssl library versions. Refer to: https://bugzilla.redhat.com/show_bug.cgi?id=1022468 for more details.

WARNING 2016-03-16 10:18:31,417 NetUtil.py:105 - Server at https://bugzilla.redhat.com/show_bug.cgi?id=1022468 is not reachable, sleeping for 10 seconds...

WARNING 2016-03-16 10:18:31,417 NetUtil.py:105 - Server at https://bugzilla.redhat.com/show_bug.cgi?id=1022468 is not reachable, sleeping for 10 seconds...

Server log :

Mostly this :

16 Mar 2016 10:21:38,909 INFO [qtp-client-4711] MetricsPropertyProvider:518 - METRICS_COLLECTOR host is not live. Skip populating resources with metrics. 16 Mar 2016 10:21:38,910 INFO [qtp-client-4711] MetricsPropertyProvider:518 - METRICS_COLLECTOR host is not live. Skip populating resources with metrics.

Re: Lost HeartBeats Ambari

New Contributor

Here is the ambari-alert log :

Exception in thread "main" java.lang.RuntimeException: java.net.ConnectException: Call From dl-s01/10.0.0.5 to dl-master:8020 failed on connection exception: java.net.ConnectException: Connection refused

Caused by: java.net.ConnectException: Call From dl-s01/10.0.0.5 to dl-master:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) at org.apache.hadoop.ipc.Client.call(Client.java:1431) at org.apache.hadoop.ipc.Client.call(Client.java:1358) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2116) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424) at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:596) at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508) ... 8 more

Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:612) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:710) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:373) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1493) at org.apache.hadoop.ipc.Client.call(Client.java:1397) ... 28 more)

Re: Lost HeartBeats Ambari

Mentor

Please check whether firewall is on on either machine, stop it or open ports for ambari and services to communicate

Re: Lost HeartBeats Ambari

Mentor

I See it says Ambari metrics collector is not live. Please check status of all metrics monitors and collector

Re: Lost HeartBeats Ambari

New Contributor

I don't know if that the status :

2841-metrics.png

what would be the next step to solve the problem ?