Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Ambari Metrics collector alert

avatar
Super Collaborator

I have an issue with the Ambari Metrics Collector and up to now I wasn't able to solve it.

It started with the Metrics Collector being restarted frequently until it got stopped. So I followed the solution provided here: https://community.hortonworks.com/questions/121137/ambari-metrics-collector-restarting-again-and-aga... (stopped the AMS completly, deleted the Hbase files, etc.)

Now when I start the metric collector again, Ambari shows the alert

Connection failed: [Errno 111] Connection refused to cgihdp4.localnet:6188

When I check this on the node, the alert is clear:

[root@cgihdp4 ~]# netstat -tulpn | grep 6188
[root@cgihdp4 ~]# 

No process is listening on the port, so I stopped and restarted the AMS on that node again:

[root@cgihdp4 ~]# ambari-metrics-collector status
AMS is not running.
[root@cgihdp4 ~]# ambari-metrics-collector start
Sa 3. Feb 11:18:07 CET 2018 Starting HBase.
starting master, logging to /var/log/ambari-metrics-collector/hbase-root-master-cgihdp4.out
Verifying ambari-metrics-collector process status...
Sa 3. Feb 11:18:10 CET 2018 Collector successfully started.
Sa 3. Feb 11:18:10 CET 2018 Initializing Ambari Metrics data model
Sa 3. Feb 11:18:27 CET 2018 Ambari Metrics data model initialization check 1
Sa 3. Feb 11:18:42 CET 2018 Ambari Metrics data model initialization check 2
Sa 3. Feb 11:18:58 CET 2018 Ambari Metrics data model initialization check 3
Sa 3. Feb 11:19:13 CET 2018 Ambari Metrics data model initialization check 4
Sa 3. Feb 11:19:30 CET 2018 Ambari Metrics data model initialization check 5
Sa 3. Feb 11:19:45 CET 2018 Ambari Metrics data model initialization check 6
Sa 3. Feb 11:20:01 CET 2018 Ambari Metrics data model initialization check 7
Sa 3. Feb 11:20:16 CET 2018 Ambari Metrics data model initialization check 8
Sa 3. Feb 11:20:34 CET 2018 Ambari Metrics data model initialization check 9
Sa 3. Feb 11:20:49 CET 2018 Ambari Metrics data model initialization check 10
[root@cgihdp4 ~]# ambari-metrics-collector status
AMS is running as process 32154.
[root@cgihdp4 ~]# netstat -tulpn | grep 6188
[root@cgihdp4 ~]# ps -ef | grep 32154
root      8187 31808  0 11:46 pts/0    00:00:00 grep 32154
root     32154     1  1 11:18 pts/0    00:00:24 /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xms640m -Xmx640m -Djava.library.path=/usr/lib/ams-hbase/lib/hadoop-native -Djava.security.auth.login.config=/etc/ams-hbase/conf/ams_collector_jaas.conf -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/collector-gc.log-201802031118 -cp /usr/lib/ambari-metrics-collector/*:/etc/ambari-metrics-collector/conf -Djava.net.preferIPv4Stack=true -Dams.log.dir=/var/log/ambari-metrics-collector -Dproc_timelineserver org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer

Looks like I miss an important point here? I checked the logs, where I see the following messages in /var/log/ambari-metrics-collector/hbase-ams-master-cgihdp4.log:

2018-02-03 11:43:59,838 WARN  [ProcedureExecutorThread-2] procedure.CreateTableProcedure: The table SYSTEM.CATALOG does not exist in meta but has a znode. run hbck to fix inconsistencies.
...

2018-02-03 11:50:19,149 ERROR [cgihdp4.localnet,61300,1517601244862_ChoreService_1] master.BackupLogCleaner: Failed to get hbase:backup table, therefore will keep all files
[stacktrace removed...]
2018-02-03 11:51:14,716 INFO  [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://cgihdp4.localnet:6188/ws/v1/timeline/metrics
This exceptions will be ignored for next 100 times
2018-02-03 11:51:14,717 WARN  [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://cgihdp4.localnet:6188/ws/v1/timeline/metrics

Some odd things I also noticed when trying to follow above mentioned solution:

  • the AMS user had no write permission on the hdfs trash, so all file deletion where failing until adding the parameter -skipTrash
  • the directory 'hbase.tmp.dir'/zookeeper did not exist (and still doesn't exist)

Any ideas on how to resolve it (I will try to run hbck as mentioned in the log)?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

I tried some things, after changing permissions on the hdfs trash and cleaning up again the dirs as per https://community.hortonworks.com/questions/121137/ambari-metrics-collector-restarting-again-and-aga...

I have been able to start the ambari metrics collector and it looks like it is running continuously now. Still when I turn off maintenance mode, I get the alert back

Connection failed:[Errno111]Connection refused to cgihdp4.localnet:6188

As far as I know 6188 is the port of the timeline server. When checking this, the timeline server service is not even installed on the cgihdp4, but is up and running on cgihdp1. So I searched for the config of the timeline server, which is in Ambari below the section Advanced ams-site -> timeline.metrics.service.webapp.address, and the address mentioned there is non surprisingly cgihdp4.localnet:6188, changed this to cgihdp1.localnet:6188, restarted the metrics collector and things are running smoothly.

So basically just a stupid config error, embarassing, but many thanks @Jay Kumar SenSharma for supporting me on this issue.

View solution in original post

4 REPLIES 4

avatar
Master Mentor
@Harald Berghoff

In the following command i see that you started AMS using root user:

[root@cgihdp4 ~]# ambari-metrics-collector start

Is there any specific reason that you are starting your AMS collector as "root" user?

AMS processes like AMS Collector and AMS monitors are supposed to be started with the "ams" user.

When you will try to start the AMS collector with root user then it will change prmission of many directories so later you might face issues while starting it using "ams" user.

avatar
Super Collaborator

@Jay Kumar SenSharma : Thanks for your answer, I simply wasn't aware that the process will change directory permissions, the only reason i used root to start it was that I tried to make sure that any issue I experience wasn't due to lacking permissions.

In the meantime the service has stopped itself:

[root@cgihdp4 ~]# ambari-metrics-collector status
AMS is not running.
[root@cgihdp4 ~]# su - ams
[ams@cgihdp4 ~]$ ambari-metrics-collector status
AMS is not running.
[ams@cgihdp4 ~]$ ambari-metrics-collector start
tee: /var/log/ambari-metrics-collector/ambari-metrics-collector-startup.out: Permission denied
Sun Feb  4 14:31:19 CET 2018 Starting HBase.
tee: /var/log/ambari-metrics-collector/ambari-metrics-collector-startup.out: Permission denied
master is running as process 23182. Continuing
master running as process 23182. Stop it first.
tee: /var/log/ambari-metrics-collector/ambari-metrics-collector-startup.out: Permission denied
Verifying ambari-metrics-collector process status...
Sun Feb  4 14:31:21 CET 2018 Collector successfully started.
Sun Feb  4 14:31:21 CET 2018 Initializing Ambari Metrics data model
...
[ams@cgihdp4 ~]$ ambari-metrics-collector status
AMS is running as process 22414.

I guess the permission denied is caused by what you just pointed out, so I will change this again, but I am confused about 'master is running as process 23182', which is the Hbase Master, running with user 'ams', but does it indicate an issue now? Otherwise nothing changed now, still no process listening to port 6188

avatar
Super Collaborator

There is no tool hbck available on the nodes, so I couldn't try that

avatar
Super Collaborator

I tried some things, after changing permissions on the hdfs trash and cleaning up again the dirs as per https://community.hortonworks.com/questions/121137/ambari-metrics-collector-restarting-again-and-aga...

I have been able to start the ambari metrics collector and it looks like it is running continuously now. Still when I turn off maintenance mode, I get the alert back

Connection failed:[Errno111]Connection refused to cgihdp4.localnet:6188

As far as I know 6188 is the port of the timeline server. When checking this, the timeline server service is not even installed on the cgihdp4, but is up and running on cgihdp1. So I searched for the config of the timeline server, which is in Ambari below the section Advanced ams-site -> timeline.metrics.service.webapp.address, and the address mentioned there is non surprisingly cgihdp4.localnet:6188, changed this to cgihdp1.localnet:6188, restarted the metrics collector and things are running smoothly.

So basically just a stupid config error, embarassing, but many thanks @Jay Kumar SenSharma for supporting me on this issue.