Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar

Ambari server typically gets to know about the service availability from Ambari agent and using the '*.pid' files created in /var/run. Following covers couple of scenarios to troubleshoot:

Scenario 1: Ambari Agent is not communicating appropriately with Ambari Server

If all the services are shown to be down in a given node, then it is most likely an Ambari agent issue.

Following steps could be used to troubleshoot Ambari agent issues

# ambari-agent status

Found ambari-agent PID: 19715 ambari-agent running.

Agent PID at: /var/run/ambari-agent/ambari-agent.pid

Agent out at: /var/log/ambari-agent/ambari-agent.out Agent log at: /var/log/ambari-agent/ambari-agent.log

and check if the pid indeed exist by doing a ps -ef.

In case the pid doesn't exist, also run ps -ef | grep 'ambari_agent' to see if a stale process is around. Eg,

# ps -ef | grep "ambari_agent" root     18626 13528  0 04:45 pts/0    00:00:00 grep ambari_agent root     19707     1  0 Feb17 ?        00:00:00 /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/AmbariAgent.py start root     19715 19707  1 Feb17 ?        00:28:01 /usr/bin/python2 /usr/lib/python2.6/site-packages/ambari_agent/main.py start

If the agent process id and /var/run/ambari-agent/ambari-agent.pid are matching, then possibly there is no issue with the agent process itself.

In case there is a mismatch, kill all the stray Ambari Agent process and remove /var/run/ambari-agent/ambari-agent.pid. Then restart the Agent. Once restarted, verify if the services are seen good in the Ambari Dashboard

At this point, also review /var/log/ambari-agent/ambari-agent.log & ambari-agent.out to see if there has been issues while starting the process itself.

One of the issue could be due to /var/lib/ambari-agent/data/structured-out-status.json. Cat this file to review the content. Typical content could be like following:

cat structured-out-status.json {"processes": [], "securityState": "UNKNOWN"}

Compare the content with the same file in another node which is working fine.

Stop ambari-agent, move this file to another file and restart ambari-agent.

Scenario 2: Ambari Agent is good, but the HDP services are still shown to be down

If there are only few services which are shown to be down, then it could be due to the /var/run/PRODUCT/product.pid file is not matching with the process running in the node.

For eg, if Hiveserver2 service is shown to be not up in Ambari, when hive is actually working fine, check the following files:

# cd /var/run/hive # ls -lrt-rw-r--r--   1 hive hadoop    6 Feb 17 07:15 hive.pid -rw-r--r--   1 hive hadoop    6 Feb 17 07:16 hive-server.pid

Check the content of these files. For eg,

# cat hive-server.pid 
31342 
# ps -ef | grep 31342
hive     31342     1  0 Feb17 ?        00:14:36 /usr/jdk64/jdk1.7.0_67/bin/java -Xmx1024m -Dhdp.version=2.2.9.0-3393 -Djava.net.preferIPv4Stack=true -Dhdp.version=2.2.9.0-3393 -Dhadoop.log.dir=/var/log/hadoop/hive -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/hdp/2.2.9.0-3393/hadoop -Dhadoop.id.str=hive -Dhadoop.root.logger=INFO,console -Djava.library.path=:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:/usr/hdp/2.2.9.0-3393/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx1024m -XX:MaxPermSize=512m -Xmx1437m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /usr/hdp/2.2.9.0-3393/hive/lib/hive-service-0.14.0.2.2.9.0-3393.jar org.apache.hive.service.server.HiveServer2 --hiveconf hive.aux.jars.path=file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar -hiveconf hive.metastore.uris=  -hiveconf hive.log.file=hiveserver2.log -hiveconf hive.log.dir=/var/log/hive

If the content of hive-server.pid and the process running for HiveServer2 aren't matching, then Ambari wouldn't report the status correctly.

Ensure that these files have correct ownership / permissions. For eg, the pid files for Hive should be owned by hive:hadoop and it should be 644. In this situation, change the ownership/ permission correctly and update the file with the correct PID of hive process. This would ensure that Ambari shows the status correctly.

Care should be taken while doing the above by ensuring that this is the only HiveServer2 process running in the system and that HiveServer2 is indeed working fine. If there are multiple HiveServer2 processes, then some of them could be stray which needs to be killed.

Post this, if possible also restart the affected services and ensure that the status of the services are correctly shown.

8,094 Views
Comments
avatar
Contributor

Nice Article....!!!!

avatar
New Contributor

Hello

I am having the same issue. I have checked the above 2 scenarios and everything is fine.

But still in Ambari the heartbeats are not received.

Can you please suggest some actions?

Thanks

Abraham

avatar
Super Collaborator

@Abraham Johnson @vpoornalingam

There is still another reason and cure for this scenario (HDP-2.6.2.0-205). It can also happen that Ambari is looking for the pid files in the wrong place

In my case the pid files were actually located at:

/var/run/hadoop/hdfs-<clustername>/hadoop-hdfs-<clustername>-namenode.pid

while ambari-agent would look at :

/var/run/hadoop/hdfs/hadoop-hdfs-hdfs-namenode.pid

In this state, with both the dir and the pid file name wrong, Ambari does not detect a running HDFS service, and you would also not be able to (re)start it.

The pid file location is deduted from this snippet in hadoop-env.sh:

export HADOOP_PID_DIR={{hadoop_pid_dir_prefix}}/$USER

I have yet to find out why Ambari decided to change the value of $USER all of a sudden.