We are seeing lot of no route to host in datanode logs and impala queries are also failing due to this. We are seeing this within the nodes and between nodes also. Issue is happening with multiple nodes. Host inspector are running with no issues.
We did lot of checks with OS and network team we couldn't find any. Any help on this.
1004:DataXceiver error processing WRITE_BLOCK operation src: /192.168.225.165:55010 dst: /192.168.225.165:1004
java.net.NoRouteToHostException: No route to host
1004:DataXceiver error processing WRITE_BLOCK operation src: /192.168.225.68:35322 dst: /192.168.225.68:1004
java.net.NoRouteToHostException: No route to host
1004:DataXceiver error processing WRITE_BLOCK operation src: /192.168.225.171:40718 dst: /192.168.225.165:1004
java.net.NoRouteToHostException: No route to host
Created 03-24-2020 03:46 AM
"No route to host" Signals that an error occurred while attempting to connect a socket to a remote address and port. Typically, the remote host cannot be reached because of an intervening firewall, or if an intermediate router is down.
If you are not using static IP's can you check on hosts 192.168.225.165,192.168.225.68 and 192.168.225.171 that their IP's haven't changed by just running
$ ifconfig
The output should match the IP's in the /etc/hosts table
Please do that and revert
Created 03-24-2020 03:57 AM
@Shelton Thanks for responding. Why would the same error comes for communication within the host also for different ports? any clues.
We are using static IP (private IP for cluster communication) and it is specified /etc/hosts across all hosts.
Created 03-24-2020 04:28 AM
Curious have you checked the firewalls are off and most important that these first 2 lines are present in your/etc/hosts files on each host.
Please uncomment them if they are commented out!
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
# Your host entry below here #
192.168.225.68 [FQDN] [ALIAS]
Maybe also try to ping and sshing from one host to another
Please revert
Created 03-24-2020 10:27 PM
We do have below entries and we have confirmed there are no firewall rules. One more thing i missed to mention we started seeing this issues when we started upgrading OS in the cluster nodes. From OEL 6.x to OEL 7.x.
But this seems to be happening on both type of host also looking at the logs.
127.0.0.1 localhost.localdomain localhost
# special IPv6 addresses
::1 localhost6.localdomain6 localhost6
fe00::0 ipv6-localnet
ff00::0 ipv6-mcastprefix
ff02::1 ipv6-allnodes
ff02::2 ipv6-allrouters
ff02::3 ipv6-allhosts
Created 03-25-2020 05:57 AM
Before starting the upgrade did you by any change validate the upgrade path with Cloudera supportmatrix this should have been your first reference source.
Created 03-26-2020 03:31 AM
Yes all were done.
Created 08-17-2020 07:43 AM
@npdell curious to know if you have fixed the above issue and what you did.