I'm seeing an issue where Hiveserver2 fails after about an hour and a half. Looking into the logs, it appears that it is attempting to contact the data node on a random port and is unsuccessful, guessing due to the firewall/security group configuration. This repeats 45 times on 5 different ports before it gives up and shuts down.
2017-07-29 10:54:22,004 INFO [main]: ipc.Client (Client.java:handleConnectionTimeout(870)) - Retrying connect to server: data.node.host/10.x.x.x:39085. Already tried 7 time(s); maxRetries=45 2017-07-29 10:54:38,697 INFO [HiveServer2-Background-Pool: Thread-93]: ipc.Client (Client.java:handleConnectionTimeout(870)) - Retrying connect to server: data.node.host/10.x.x.x:41192. Already tried 2 time(s); maxRetries=3 2017-07-29 10:54:42,024 INFO [main]: ipc.Client (Client.java:handleConnectionTimeout(870)) - Retrying connect to server: data.node.host/10.x.x.x:39085. Already tried 8 time(s); maxRetries=45
We're operating in a rather restrictive environment where every open port must be identified and documented, so up to now we have only opened the ports listed in the documentation for each node. I suspect that wider port ranges are required for the internode connections, but I have been unable to find those documented anywhere.
As an update to this, we were able to run a test by opening all ports between the nodes in the cluster and it appears to resolve the issue. Is it generally assumed that all ports are open within the cluster, or are there certain port ranges that should be opened?
@hwx : it would a good idea to check the 'hosts' file on the machine running hiverserver2 and check that it contains the entry for datanode host.
We're looking at the possibility that the ephemeral ports are being blocked when used asynchronously. I forgot to mention in the original post that this is on AWS, which allows synchronous responses to outbound connections on these ports, but it appears that the failing connections are being attempted separately.
Login to target data node host and check whether both data node and hive server is able to ping each other over network and reachable.
Checking if any hive components is running on data node can help to understand communication between this two servers.
Check what is status of data node using Namenode UI.