Created 11-05-2017 08:21 AM
from some unclear reason when we start the ambari agent on master machine its failed
from the log we can see that:
ERROR 2017-10-02 11:58:42,597 script_alert.py:123 - [Alert][hive_server_process] Failed with result CRITICAL: ['Connection failed on host machine-master01.pop.com:10000 (Traceback (most recent call last):\n File "/var/lib/ambari-agent/cache/common-services/HIVE/0.12.0.2.0/package/alerts/alert_hive_thrift_port.py", line 211, in execute\n ldap_password=ldap_password)\n File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/hive_check.py", line 79, in check_thrift_port_sasl\n timeout=check_command_timeout)\n File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 155, in __init__\n self.env.run()\n File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 160, in run\n
what cause this problem?
Created 11-05-2017 10:54 AM
Yes, We can kill if other users are not using HiveServer2 (just to be sure that they are not running any job)
# cat /var/run/hive/hive-server.pid # ps -ef | grep `cat /var/run/hive/hive-server.pid` # netstat -tnlpa | grep `cat /var/run/hve/hive-server.pid` # kill -9 `cat /var/run/hive/hive-server.pid`
.
Above commands like cat & ps are to confirm if we are killing the correct process.
Created on 11-05-2017 08:40 AM - edited 08-18-2019 02:00 AM
It looms like the Hive Server Process is not running or due to some network issue the HiveServer2 host & ports are ot accessible from that agent machine and hence we see this alert:
[Alert][hive_server_process] Failed with result CRITICAL: ['Connection failed on host machine-master01.pop.com:10000
.
The Alert scheduler will keep triggering the alert in the default specified interval. So we can try the following:
1. Check if the hiveserver2 process is running and listening to port 10000, On the HiveServer2 please run the following commands to see if the port 10000 is listening and the hostname is correct .
# netstat -tnlpa | grep 10000 # service iptables top # hostname -f
2. Also as we see the "ldap_password=ldap_password" string in the error stack trace so it might be due to issue with the LDAP authentication/LDAP as well. So checking the hiveserver2 log will also be helpful.
Also please kill the HiveServer2 process (If possible) and then try restarting it to see if it fixes the issue.
.
3. From Agent machine please check if that host & port is accessible?
# nc -v machine-master01.pop.com 10000 (OR) # telnet machine-master01.pop.com 10000
.
From Ambari Side we can try disabling the "HiveServer2 Process" alert temporarily to avoid seeing this alert.
Ambari UI --> "Alerts" (Tab) --> Search for "HiveServer2 Process" alert --> click on "Enabled" toggle button
.
Then after restarting the ambari agent check the ambari-agent log again.
.
Created 11-05-2017 08:51 AM
Also can you please share the complete "/var/log/ambari-agent/ambari-agent.log" to see if there is any other issue which is causing ambari-agent to not come up.
Created 11-05-2017 09:22 AM
when we run the netstat -tnlpa | grep 10000 , we get
tcp 0 0 45.89.12.111:10000 45.89.12.110:44570 ESTABLISHED 15598/java
tcp 0 0 45.89.12.111:10000 45.89.12.110:55109 ESTABLISHED 15598/java
regarding the iptables it is stooped , and we get the output - connection refused from nc command , and the full machine name is - machine-master03.pop.com
Created 11-05-2017 09:38 AM
As you mentioned that the "nc" command output if not connecting which indicates that the N/W (firewall issue) OR incorrect hostmame mapping.
I see that my HiveServer2 process is bound to all interfaces as following:
# netstat -tnlpa | grep 10000 tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 1690/java
.
Can you please check if your "ambari-agent" host machine (and other hosts of the cluster) has correct IP/Hostname mapping inside their "/etc/hosts" file to point to the HiveServer2 host.
# cat /etc/hosts | grep 'machine-master03.pop.com'
45.89.12.111 machine-master03.pop.com
.
In the previously mentioned StackTrace i see that the hostname was different "machine-master01.pop.com" but in yoru recent comment i see that you mentioned hostname as "machine-master03.pop.com"
[Alert][hive_server_process] Failed with result CRITICAL: ['Connection failed on host machine-master01.pop.com:10000
.
So please check if the IP Address and Hostname mapping is correct in yoru "/etc/hosts" file.
Created 11-05-2017 10:24 AM
yes the IP's and hostname are now ok , but still cant start the ambari-agent do you think need to restart the proccess that hold the port - 10000 ?
Created 11-05-2017 10:34 AM
Yes, please try restarting HiveServer2 process to see if it is coming up fine and no errors are observed in the hiveserevr2 logs. Also we can check if the port 10000 started successfully or not.
The we can try restarting the agent to see if it starts fine.
It agent startup still fails then please share the *complete* ambari-agent logs.
Created 11-05-2017 09:42 AM
As you mentioned that the "nc" command output if not connecting which indicates that the N/W (firewall issue) OR incorrect hostmame mapping.
I see that my HiveServer2 process is bound to all interfaces as following:
# netstat -tnlpa | grep 10000 tcp 0 0 0.0.0.0:10000 0.0.0.0:* LISTEN 1690/java
.
Can you please check if your "ambari-agent" host machine (and other hosts of the cluster) has correct IP/Hostname mapping inside their "/etc/hosts" file to point to the HiveServer2 host.
# cat /etc/hosts | grep 'machine-master03.pop.com'
45.89.12.111 machine-master03.pop.com
.
In the previously mentioned StackTrace i see that the hostname was different "machine-master01.pop.com" but in yoru recent comment i see that you mentioned hostname as "machine-master03.pop.com".
So please check if the IP Address and Hostname mapping is correct in your "/etc/hosts" file.
[Alert][hive_server_process] Failed with result CRITICAL: ['Connection failed on host machine-master01.pop.com:10000
Created 11-05-2017 10:54 AM
Yes, We can kill if other users are not using HiveServer2 (just to be sure that they are not running any job)
# cat /var/run/hive/hive-server.pid # ps -ef | grep `cat /var/run/hive/hive-server.pid` # netstat -tnlpa | grep `cat /var/run/hve/hive-server.pid` # kill -9 `cat /var/run/hive/hive-server.pid`
.
Above commands like cat & ps are to confirm if we are killing the correct process.