Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Failed to receive heartbeat from agent. (Current Step)

avatar
New Contributor

Hi,

 

I am trying to install development instance of Hadoop on Microsoft Azure VM (A single node cluster).  I am running Ubuntu 12.04.3 LTS Linux.

 

Everything is going well until the very last step in the installation process where I get the following - 

 

Installation failed. Failed to receive heartbeat from agent.

  • Ensure that the host's hostname is configured properly.
  • Ensure that port 7182 is accessible on the Cloudera Manager server (check firewall rules).
  • Ensure that ports 9000 and 9001 are free on the host being added.
  • Check agent logs in /var/log/cloudera-scm-agent/ on the host being added (some of the logs can be found in the installation details).

I looked at the logs and see the following errors -

 

>>[19/Nov/2013 15:00:55 +0000] 1922 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/process 
>>[19/Nov/2013 15:00:55 +0000] 1922 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/supervisor 
>>[19/Nov/2013 15:00:55 +0000] 1922 MainThread agent INFO Re-using pre-existing directory: /run/cloudera-scm-agent/supervisor/include 
>>[19/Nov/2013 15:00:55 +0000] 1922 MainThread agent INFO Connecting to previous supervisor: agent-1304-1384872987. 
>>[19/Nov/2013 15:00:55 +0000] 1922 MainThread _cplogging INFO [19/Nov/2013:15:00:55] ENGINE Bus STARTING 
>>[19/Nov/2013 15:00:55 +0000] 1922 MainThread _cplogging INFO [19/Nov/2013:15:00:55] ENGINE Started monitor thread '_TimeoutMonitor'. 
>>[19/Nov/2013 15:00:55 +0000] 1922 HTTPServer Thread-2 _cplogging ERROR [19/Nov/2013:15:00:55] ENGINE Error in HTTP server: shutting down 
>>Traceback (most recent call last): 
>> File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/CherryPy-3.2.2-py2.7.egg/cherrypy/process/servers.py", line 187, in _start_http_thread 
>> self.httpserver.start() 
>> File "/usr/lib/cmf/agent/build/env/lib/python2.7/site-packages/CherryPy-3.2.2-py2.7.egg/cherrypy/wsgiserver/wsgiserver2.py", line 1825, in start 
>> raise socket.error(msg) 
>>error: No socket could be created on ('NexusHadoopVM', 9000) -- [Errno 99] Cannot assign requested address 
>> 
>>[19/Nov/2013 15:00:55 +0000] 19

 

I checked if anything is already using 9000 and 9001 via

lsof -i :9000 

lsof -i :9001

as well as netstat and both came with nothing.  In the Azure VM manager I specified that both 9001 and 9002 are open (private and public), not sure what else needs to be configured.

 

I also using public IP address when adding a node to the cluster.

 

Please help!!!

14 REPLIES 14

avatar
Super Collaborator

Hi there,

 

The "[Errno 99] Cannot assign requested address" points to hostname resolution rather than a port conflict. Please run this on the node:

 

$  python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'

 

This should return the fully-qualified domain name as well as the IP address, confirming forward and reverse name resolution. Sanity-check this output against:

 

$ dig NexusHadoopVM

$ dig -x [IP returned in above dig command]

 

You may also wish to check your /etc/hosts file to make sure everything is OK there.

 

Regards,

--

avatar
Explorer

For various reasons I'm too embarrassed to talk about, we've run into this a few time with dev clusters in our private cloud and DNS getting mangled.  We've found that if the python script that smark provided works, the DNS stuff is setup correctly and the agent will start.

 

Thanks smark for sharing.

 

python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'

avatar
Explorer

How did you solve the DNS issue. 

When I ran the Python command the FQDN is correct, but I get a 198 IP. I dont know where this is coming from. 

 

I have a 3 node cluster and hosts file defined correctly on all 3 nodes. The Python returns incorrect IP on all 3 nodes. 

python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'
node2.hadoopdomain 198.105.254.228

 

while the node is is on 192.168.1.6 IP.

 

Any ideas?

 

Thank You, 

Pranay Vyas

avatar
Cloudera Employee

Just saved me an hour of debugging. Thanks!

avatar
Expert Contributor

Sounds like your nsswitch is wrong.  It should be "files dns" not "dns files".

 

I would definately check it out and verify you don't have it setup wrong.

avatar
Explorer

Thanks for your response,

I changed the NSSWITCH host to file dns.

Now I am not able to ping any outside site.

Google.com returns not found.  buti get response when I ping the IP.

 

any idea what's causing it?

 

Regards,

Pranay Vyas

avatar
Explorer

Okay, I was able to pass through the dns issue and changed the nsswitch on all nodes to files dns.

I uninstalled CLoudera manager and started all again.

 

It failed with the same error.

Installation failed. Failed to receive heartbeat from agent.

  • Ensure that the host's hostname is configured properly.
  • Ensure that port 7182 is accessible on the Cloudera Manager server (check firewall rules).
  • Ensure that ports 9000 and 9001 are free on the host being added.

 

[Errno 99] Cannot assign requested address on ('base.hadoopdomain', 9000) -- [Errno 99] Cannot assign requested address

 

The Python socket command still give incorrect IP

[root@base ~]# python -c 'import socket; print socket.getfqdn(), socket.gethostbyname(socket.getfqdn())'
base.hadoopdomain 198.105.254.228

 

It gives same IP on all the nodes. THe FQDN comes correctly.

 

Regards,

Pranay Vyas

 

avatar
Explorer

Solved it.

The issue was with /etc/hosts file.

 

For some reason the Cloudera Manager was referring to localdomain which was not part of my /etc/host file

 

Had to add highlighted red texts on /etc/host file to resolve this error.

 

 

127.0.0.1   localhost.hadoopdomain localhost
::1         localhost.hadoopdomain localhost
127.0.0.1   localhost.localdomain localhost
127.0.0.1   localdomain localhost

192.168.1.8 base.hadoopdomain.com base base.hadoopdomain
192.168.1.6 node1.hadoopdomain.com node1 node1.hadoopdomain
192.168.1.7 node2.hadoopdomain.com node2 node2.hadoopdomain

 

Regards,

Pranay Vyas

 

avatar
Expert Contributor

Great job.

 

I try to keep the names as simple as possible so I can run thousands of scripts.

 

My hosts files is like:

 

127.0.0.1