Support Questions

Find answers, ask questions, and share your expertise

ambari-agent cant start

avatar

from some unclear reason when we start the ambari agent on master machine its failed

from the log we can see that:

ERROR 2017-10-02 11:58:42,597 script_alert.py:123 - [Alert][hive_server_process] Failed with result CRITICAL: ['Connection failed on host machine-master01.pop.com:10000 (Traceback (most recent call last):\n File "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_thrift_port.py", line 211, in execute\n ldap_password=ldap_password)\n File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/hive_check.py", line 79, in check_thrift_port_sasl\n timeout=check_command_timeout)\n File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 155, in __init__\n self.env.run()\n File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run\n

what cause this problem?

Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Master Mentor

@uri ben-ari

Yes, We can kill if other users are not using HiveServer2 (just to be sure that they are not running any job)

# cat /var/run/hive/hive-server.pid 
# ps -ef | grep `cat /var/run/hive/hive-server.pid`
# netstat -tnlpa | grep `cat /var/run/hve/hive-server.pid`
# kill -9 `cat /var/run/hive/hive-server.pid`

.

Above commands like cat & ps are to confirm if we are killing the correct process.

View solution in original post

8 REPLIES 8

avatar
Master Mentor

@uri ben-ari

It looms like the Hive Server Process is not running or due to some network issue the HiveServer2 host & ports are ot accessible from that agent machine and hence we see this alert:

[Alert][hive_server_process] Failed with result CRITICAL: ['Connection failed on host machine-master01.pop.com:10000 

.

The Alert scheduler will keep triggering the alert in the default specified interval. So we can try the following:

1. Check if the hiveserver2 process is running and listening to port 10000, On the HiveServer2 please run the following commands to see if the port 10000 is listening and the hostname is correct .

# netstat -tnlpa | grep 10000
# service iptables top
# hostname -f

2. Also as we see the "ldap_password=ldap_password" string in the error stack trace so it might be due to issue with the LDAP authentication/LDAP as well. So checking the hiveserver2 log will also be helpful.

Also please kill the HiveServer2 process (If possible) and then try restarting it to see if it fixes the issue.

.

3. From Agent machine please check if that host & port is accessible?

# nc -v machine-master01.pop.com 10000
(OR)
# telnet machine-master01.pop.com 10000

.

From Ambari Side we can try disabling the "HiveServer2 Process" alert temporarily to avoid seeing this alert.

Ambari UI --> "Alerts" (Tab) --> Search for "HiveServer2 Process" alert --> click on "Enabled" toggle button

43432-hiveserver2-process-disable-alert.png

.

Then after restarting the ambari agent check the ambari-agent log again.

.

avatar
Master Mentor

@uri ben-ari

Also can you please share the complete "/var/log/ambari-agent/ambari-agent.log" to see if there is any other issue which is causing ambari-agent to not come up.

.

avatar

when we run the netstat -tnlpa | grep 10000 , we get

tcp 0 0 45.89.12.111:10000 45.89.12.110:44570 ESTABLISHED 15598/java

tcp 0 0 45.89.12.111:10000 45.89.12.110:55109 ESTABLISHED 15598/java

regarding the iptables it is stooped , and we get the output - connection refused from nc command , and the full machine name is - machine-master03.pop.com

Michael-Bronson

avatar
Master Mentor

@uri ben-ari

As you mentioned that the "nc" command output if not connecting which indicates that the N/W (firewall issue) OR incorrect hostmame mapping.

I see that my HiveServer2 process is bound to all interfaces as following:

# netstat -tnlpa | grep 10000
tcp        0      0 0.0.0.0:10000               0.0.0.0:*                   LISTEN      1690/java

.

Can you please check if your "ambari-agent" host machine (and other hosts of the cluster) has correct IP/Hostname mapping inside their "/etc/hosts" file to point to the HiveServer2 host.

# cat /etc/hosts | grep 'machine-master03.pop.com'
45.89.12.111 machine-master03.pop.com

.

In the previously mentioned StackTrace i see that the hostname was different "machine-master01.pop.com" but in yoru recent comment i see that you mentioned hostname as "machine-master03.pop.com"

[Alert][hive_server_process] Failed with result CRITICAL: ['Connection failed on host machine-master01.pop.com:10000 

.

So please check if the IP Address and Hostname mapping is correct in yoru "/etc/hosts" file.

avatar

yes the IP's and hostname are now ok , but still cant start the ambari-agent do you think need to restart the proccess that hold the port - 10000 ?

Michael-Bronson

avatar
Master Mentor

@uri ben-ari

Yes, please try restarting HiveServer2 process to see if it is coming up fine and no errors are observed in the hiveserevr2 logs. Also we can check if the port 10000 started successfully or not.

The we can try restarting the agent to see if it starts fine.

It agent startup still fails then please share the *complete* ambari-agent logs.

avatar
Master Mentor

@uri ben-ari

As you mentioned that the "nc" command output if not connecting which indicates that the N/W (firewall issue) OR incorrect hostmame mapping.

I see that my HiveServer2 process is bound to all interfaces as following:

# netstat -tnlpa | grep 10000
tcp  0  0 0.0.0.0:10000  0.0.0.0:*  LISTEN  1690/java

.

Can you please check if your "ambari-agent" host machine (and other hosts of the cluster) has correct IP/Hostname mapping inside their "/etc/hosts" file to point to the HiveServer2 host.

# cat /etc/hosts | grep 'machine-master03.pop.com'
45.89.12.111 machine-master03.pop.com

.

In the previously mentioned StackTrace i see that the hostname was different "machine-master01.pop.com" but in yoru recent comment i see that you mentioned hostname as "machine-master03.pop.com".

So please check if the IP Address and Hostname mapping is correct in your "/etc/hosts" file.

[Alert][hive_server_process] Failed with result CRITICAL: ['Connection failed on host machine-master01.pop.com:10000 

avatar
Master Mentor

@uri ben-ari

Yes, We can kill if other users are not using HiveServer2 (just to be sure that they are not running any job)

# cat /var/run/hive/hive-server.pid 
# ps -ef | grep `cat /var/run/hive/hive-server.pid`
# netstat -tnlpa | grep `cat /var/run/hve/hive-server.pid`
# kill -9 `cat /var/run/hive/hive-server.pid`

.

Above commands like cat & ps are to confirm if we are killing the correct process.