Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

Node lost connection to the internet

Highlighted

Node lost connection to the internet

Explorer

Hi,

 

on the other day I experienced a weird issue on a staging cluster, where I could not not find the root cause. The issue was that one of the cluster's gateway nodes completely lost the connection to the cluster, and basically to everything else, I wasn't able to use ssh and it did not respond to a ping.

I don't know if it's related, but a day before the issue I changed some Flume agents to use (probably misconfigured) FileChannel instead of MemoryChannel. Only 2 Flume agents and the cloudera scm agent service run on that node, nothing else.

I did not find so much meaningful log lines, the best I found was in the cloudera-scm-agent logs:

[25/Nov/2017 13:08:41 +0000] 24600 MainThread throttling_logger INFO     (14 skipped) Identified java component java7 with full version JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera ja
va version "1.7.0_67" Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)  for requested version 7.
[25/Nov/2017 13:11:26 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.03 max:0.04 LIFE_MAX:0.22
[25/Nov/2017 13:21:26 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.02 max:0.04 LIFE_MAX:0.22
[25/Nov/2017 13:31:27 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.02 max:0.04 LIFE_MAX:0.22
[25/Nov/2017 13:36:16 +0000] 24600 DnsResolutionMonitor throttling_logger INFO     DnsTest return=1
stdout=Error occurred during initialization of VM
java.lang.OutOfMemoryError: unable to create new native thread

stderr=
[25/Nov/2017 13:36:46 +0000] 24600 DnsResolutionMonitor throttling_logger INFO     DnsTest not running. Java not located.
[25/Nov/2017 13:41:27 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.03 max:0.04 LIFE_MAX:0.22
[25/Nov/2017 13:51:28 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.02 max:0.03 LIFE_MAX:0.22
[25/Nov/2017 14:01:28 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.02 max:0.04 LIFE_MAX:0.22
[25/Nov/2017 14:06:47 +0000] 24600 DnsResolutionMonitor throttling_logger INFO     (59 skipped) DnsTest not running. Java not located.
[25/Nov/2017 14:11:29 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.02 max:0.04 LIFE_MAX:0.22
[25/Nov/2017 14:21:29 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.03 max:0.04 LIFE_MAX:0.22
[25/Nov/2017 14:31:30 +0000] 24600 MainThread heartbeat_tracker INFO     HB stats (seconds): num:40 LIFE_MIN:0.02 min:0.02 mean:0.03 max:0.07 LIFE_MAX:0.22
[25/Nov/2017 14:36:49 +0000] 24600 DnsResolutionMonitor throttling_logger INFO     (59 skipped) DnsTest not running. Java not located.
[25/Nov/2017 14:37:00 +0000] 24600 MainThread agent        ERROR    Heartbeating to cdh-manager.staging.io:7182 failed.
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.12.0-py2.7.egg/cmf/agent.py", line 1406, in _send_heartbeat
    response = self.requestor.request('heartbeat', dict(request=heartbeat))
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/avro-1.6.3-py2.7.egg/avro/ipc.py", line 141, in request
    return self.issue_request(call_request, message_name, request_datum)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/avro-1.6.3-py2.7.egg/avro/ipc.py", line 254, in issue_request
    call_response = self.transceiver.transceive(call_request)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/avro-1.6.3-py2.7.egg/avro/ipc.py", line 483, in transceive
    result = self.read_framed_message()
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/avro-1.6.3-py2.7.egg/avro/ipc.py", line 487, in read_framed_message
    response = self.conn.getresponse()
  File "/usr/lib64/python2.7/httplib.py", line 1089, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 444, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 400, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
timeout: timed out
[25/Nov/2017 14:37:00 +0000] 24600 MainThread agent        WARNING  Long HB processing time: 45.0594758987
[25/Nov/2017 14:37:00 +0000] 24600 MainThread agent        WARNING  Delayed HB: 30s since last
[25/Nov/2017 14:37:00 +0000] 24600 MainThread agent        ERROR    Heartbeating to cdh-manager.staging.io:7182 failed.
Traceback (most recent call last):
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/cmf-5.12.0-py2.7.egg/cmf/agent.py", line 1401, in _send_heartbeat
    self.master_port)
  File "/usr/lib64/cmf/agent/build/env/lib/python2.7/site-packages/avro-1.6.3-py2.7.egg/avro/ipc.py", line 469, in __init__
    self.conn.connect()
  File "/usr/lib64/python2.7/httplib.py", line 807, in connect
    self.timeout, self.source_address)
  File "/usr/lib64/python2.7/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno -2] Name or service not known

After that, it kept logging the last two stacktraces. It seems that it wasn't able to resolve the DNS address of the manager node, which makes sense if it lost the connection to the internet - but the machine runs on AWS, so it's unlikely that it just lost the connection out of nowhere. Also, I don't know what to do with the JVM errors.