Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

NodeManager Web UI connection timeouts; always 5 seconds

avatar
Contributor

I have a fairly new cluster with 1 management node (Bright Cluster Manager and Ambari server), 2 NameNodes (1 active, 1 passive) and 17 DataNodes. We're running Hortonworks HDP 2.3.2 and Ambari 2.1.2.

Each node has a 2 10GbE NICs which are bonded together and jumbo frames (MTU=9000) is enabled on the interfaces.

From the very beginning of the cluster we have been receiving sporadic NodeManager Web UI alerts in Ambari. For all 17 DataNodes we get connection timeouts throughout the day. These timeouts are not correlated with any sort of load on the system, they happen no matter what.

When the connection to port 8042 is successful the connection is is around 5-7ms but when the connections fails I get response times of 5 seconds. Never 3 seconds or 6 seconds, always 5 seconds. For example...

<code>[root@XXXX ~]# python2.7 YARN_response.py
Testing response time at http://XXXX:8042
Output is written if http response is > 1 second.
Press Ctrl-C to exit!

2016-02-08 07:19:17.877947 Host: XX23:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:19:22.889430 Host: XX25:8042 conntime - 5.0078 seconds, HTTP response - 200
2016-02-08 07:19:48.466520 Host: XX15:8042 conntime - 5.0071 seconds, HTTP response - 200
2016-02-08 07:20:24.423817 Host: XX15:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:20:29.449196 Host: XX23:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:21:00.190991 Host: XX19:8042 conntime - 5.0077 seconds, HTTP response - 200
2016-02-08 07:21:05.210073 Host: XX24:8042 conntime - 5.0073 seconds, HTTP response - 200
2016-02-08 07:21:28.738996 Host: XX17:8042 conntime - 5.0078 seconds, HTTP response - 200
2016-02-08 07:21:33.747728 Host: XX18:8042 conntime - 5.0086 seconds, HTTP response - 200
2016-02-08 07:21:38.764546 Host: XX22:8042 conntime - 5.0075 seconds, HTTP response - 200

If I let the script run long enough then every DataNode will eventually turn up.

Has anyone out there ever seen something like this? Because of the discrete connection time I'm thinking it must be some kind of timeout that is happening. My network team says that the top of rack switches all look good. I've running out of ideas. Any suggestions?

1 ACCEPTED SOLUTION

avatar
Contributor

For once I can solve my own problem. 🙂 It turns out that this is a DNS issue and the solution is to put

    options single-request 

in /etc/resolv.conf on all nodes.

This option is described in the man page as such:

    single-request (since glibc 2.10)
    Sets RES_SNGLKUP in _res.options.  By default, glibc performs IPv4 and IPv6 lookups in parallel since version 2.9.  Some appliance DNS servers cannot handle these queries properly and make the requests time out. This option disables the behavior and makes glibc perform the IPv6 and IPv4 requests sequentially (at the cost of some slowdown of the resolving process). 

Cluster performance is now as expected.

View solution in original post

2 REPLIES 2

avatar
Contributor

For once I can solve my own problem. 🙂 It turns out that this is a DNS issue and the solution is to put

    options single-request 

in /etc/resolv.conf on all nodes.

This option is described in the man page as such:

    single-request (since glibc 2.10)
    Sets RES_SNGLKUP in _res.options.  By default, glibc performs IPv4 and IPv6 lookups in parallel since version 2.9.  Some appliance DNS servers cannot handle these queries properly and make the requests time out. This option disables the behavior and makes glibc perform the IPv6 and IPv4 requests sequentially (at the cost of some slowdown of the resolving process). 

Cluster performance is now as expected.

avatar
Master Mentor

@Jason Breitweg I have accepted your answers. Could you publish an article based on this? This is really helpful.