On all machines except the CM host the SCM agent is trying to heartbeat to the wrong IP address.
The logs show:
ERROR Heartbeating to 172.21.8.84:7182 failed
The agent config.ini shows the correct IP address to the SCM server:
# Hostname of Cloudera SCM Server
# Port that server is listening on
The network is set up correctly. I've run hostname -f on all the hosts and the /etc/hosts file is configured properly on all hosts.
I can't figure out why it's trying to heartbeat to the wrong IP address. Any ideas?
Can you do an nslookup (or a ping) on both hostnames and see what IP is used? (eg. elephant.dynsight.local as well as elephant). Also, what do you have in the HOSTNAME property in /etc/sysconfig/network?
Things get a bit tricky in multi-homed environments and it seems that the installer somehow picked up the wrong interface. Also, you might want to see if the Host Inspector will give you a report. In CM, on the "Hosts" page, you can run the Host Inspector. Not sure if it will work, though, if your agents are not connecting.
Finally, I would do a reverse lookup on the IP 192.168.56.103 to see if it resolves back to your expected hostname:
dig -x 192.168.56.103
/etc/sysconfig/network looks like this on each host:
When I run dig from tiger I get this response. It looks like it's querying 192.168.1.1 instead of 192.168.56.1. Any thoughts on that?
; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.17.rc1.el6_4.6 <<>> -x 192.168.56.103
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20919
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;; AUTHORITY SECTION:
126.96.36.199.in-addr.arpa. 10800 INSOAlocalhost. nobody.invalid. 1 600 1200 604800 10800
;; Query time: 19 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Fri Mar 28 16:19:08 2014
;; MSG SIZE rcvd: 104
That's probably because you have 192.168.1.1 defined as your "nameserver" in /etc/resolv.conf. Or it's your default gateway and you have no DNS servers defined, so dig is trying to ask your router. There is no ANSWER section of the dig output so that means dig was not able to resolve that IP address. What do you have in /etc/resolv.conf and /etc/nsswitch.conf?
The following is a one liner that will check forward and reverse lookup on a node
python -c "import socket; print socket.getfqdn(); print socket.gethostbyname(socket.getfqdn())"
Hosts file layout is important as well (regardless of DNS configuration). Review the discussion here: