Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

yarn_nodemanager_health on the server XXXXXXXXXXXXXXXXXXX in the environment prod

Highlighted

yarn_nodemanager_health on the server XXXXXXXXXXXXXXXXXXX in the environment prod

Expert Contributor

I keep getting this alert just from 2 servers continuously. I did a HDFS rebalance yesterday thinking that it would help, but even after rebalance, I keep getting this alert from the same 2 servers.

As soon as I get this alert, I check Yarn Node manager UI and it is up and running all the time.

Since, this alert is of type web, I am not able to increase threshold.

Any suggestions ?

3 REPLIES 3

Re: yarn_nodemanager_health on the server XXXXXXXXXXXXXXXXXXX in the environment prod

Super Collaborator

The YARN NodeManager Health alert is actually a SCRIPT style alert which checks the ws/v1/node/info endpoint on the ResourceManager. Depending on your version of Ambari, you should be able to change the parameters of this script alert, including the timeouts.

Can you provide:

- The version of Ambari

- The text of the alert

- Whether you are Kerberized (there are problems with kerberos ticket expiration on older versions of Ambari)

- The output of http://<resource-manager-host>:<port>/ws/v1/node/info

Highlighted

Re: yarn_nodemanager_health on the server XXXXXXXXXXXXXXXXXXX in the environment prod

Expert Contributor

@jonathan Hurley

- The version of Ambari 2.2.0 -

The text of the alert I get 2 alerts and sometimes just any of the below : **********************************************************************

Alert 1 ~~~~~~~~~~~

Description :

yarn_nodemanager_health on the server XXXXXXXXXXXXXX in the environment prod Message : ---------- Connection failed to http://xxxxxxxxxxxxxxxxxxx:8042/ws/v1/node/info (Traceback (most recent call last): File /var/lib/ambari-agent/cache/common-services/YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py, line 165, in execute url_response = urllib2.urlopen(query, timeout=connection_timeout) File /usr/lib64/python2.6/urllib2.py, line 126, in urlopen return _opener.open(url, data, timeout) File /usr/lib64/python2.6/urllib2.py, line 391, in open response = self._open(req, data) File /usr/lib64/python2.6/urllib2.py, line 409, in _open '_open', req) File /usr/lib64/python2.6/urllib2.py, line 369, in _call_chain result = func(*args) File /usr/lib64/python2.6/urllib2.py, line 1190, in http_open return self.do_open(httplib.HTTPConnection, req) File /usr/lib64/python2.6/urllib2.py, line 1165, in do_open raise URLError(err) URLError: <urlopen error timed out> )

Alert 2 ~~~~~~~~~~~~

Description :

yarn_nodemanager_webui on the server XXXXXXXXXXXXXXX in the environment prod

Message : ------------ Connection failed to http://xxxxxxxxxxxxxxxxxxx:8042/ws/v1/node/info (<urlopen error timed out>) - Whether you are Kerberized (there are problems with kerberos ticket expiration on older versions of Ambari) NON Kerberized -

The output of http://xxxxxxxxxxxxxxxxxxx:8042/ws/v1/node/info>

below output is from one of the node manager

:<port>/ws/v1/node/info <nodeInfo><healthReport/><totalVmemAllocatedContainersMB>219340</totalVmemAllocatedContainersMB><totalPmemAllocatedContainersMB>104448</totalPmemAllocatedContainersMB><totalVCoresAllocatedContainers>38</totalVCoresAllocatedContainers><vmemCheckEnabled>false</vmemCheckEnabled><pmemCheckEnabled>true</pmemCheckEnabled><lastNodeUpdateTime>1482243519843</lastNodeUpdateTime><nodeHealthy>true</nodeHealthy><nodeManagerVersion>2.7.1.2.3.0.0-2557</nodeManagerVersion><nodeManagerBuildVersion>2.7.1.2.3.0.0-2557 from 9f17d40a0f2046d217b2bff90ad6e2fc7e41f5e1 by jenkins source checksum 8f35b2e7d68590c926a098f47c56edb</nodeManagerBuildVersion><nodeManagerVersionBuiltOn>2015-07-14T13:15Z</nodeManagerVersionBuiltOn><hadoopVersion>2.7.1.2.3.0.0-2557</hadoopVersion><hadoopBuildVersion>2.7.1.2.3.0.0-2557 from 9f17d40a0f2046d217b2bff90ad6e2fc7e41f5e1 by jenkins source checksum 54f9bbb4492f92975e84e390599b881d</hadoopBuildVersion><hadoopVersionBuiltOn>2015-07-14T13:08Z</hadoopVersionBuiltOn><id><Node manager host>:45454</id><nodeHostName><node maneger host></nodeHostName></nodeInfo>

Highlighted

Re: yarn_nodemanager_health on the server XXXXXXXXXXXXXXXXXXX in the environment prod

Super Collaborator

OK, this is a simple problem of the URL taking too long to return data. The default timeout here is 5s - it seems like your servers can't return the JMX data within that time period. Unfortunately, the ability to change parameters like timeouts via the web client wasn't added until Ambari 2.4. On Ambari 2.2.0, you'll need to change it manually:

You'll first want to get the current alert definition as it exists today:

GET api/v1/clusters/<clusterName>/alert_definitions?AlertDefinition/name=yarn_nodemanager_health&fields=AlertDefinition/*

{
  href": "http://localhost:8080/api/v1/clusters/c1/alert_definitions/12345",
  "AlertDefinition": {
    "cluster_name": "c1",
    "component_name": "NODEMANAGER",
    "description": "This host-level alert checks the node health property available from the NodeManager component.",
    "enabled": true,
    "help_url": null,
    "id": 10,
    "ignore_host": false,
    "interval": 1,
    "label": "NodeManager Health",
    "name": "yarn_nodemanager_health",
    "repeat_tolerance": 1,
    "repeat_tolerance_enabled": false,
    "scope": "HOST",
    "service_name": "YARN",
    "source": {
      "parameters": [
        {
          "name": "connection.timeout",
          "display_name": "Connection Timeout",
          "units": "seconds",
          "value": 5,
          "description": "The maximum time before this alert is considered to be CRITICAL",
          "type": "NUMERIC",
          "threshold": "CRITICAL"
        }
      ],
      "path": "YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py",
      "type": "SCRIPT"
    }
  }
}

Change the connection.timeout parameter to something higher and PUT the entire JSON back to the specific ID returned to you:

PUT api/v1/clusters/<clusterName>/alert_definitions/12345

{
  "AlertDefinition": {
    "cluster_name": "c1",
    "component_name": "NODEMANAGER",
    "description": "This host-level alert checks the node health property available from the NodeManager component.",
    "enabled": true,
    "help_url": null,
    "id": 10,
    "ignore_host": false,
    "interval": 1,
    "label": "NodeManager Health",
    "name": "yarn_nodemanager_health",
    "repeat_tolerance": 1,
    "repeat_tolerance_enabled": false,
    "scope": "HOST",
    "service_name": "YARN",
    "source": {
      "parameters": [
        {
          "name": "connection.timeout",
          "display_name": "Connection Timeout",
          "units": "seconds",
          "value": 10,
          "description": "The maximum time before this alert is considered to be CRITICAL",
          "type": "NUMERIC",
          "threshold": "CRITICAL"
        }
      ],
      "path": "YARN/2.1.0.2.0/package/alerts/alert_nodemanager_health.py",
      "type": "SCRIPT"
    }
  }
}
Don't have an account?
Coming from Hortonworks? Activate your account here