Created 02-03-2018 02:35 PM
I have an issue with the Ambari Metrics Collector and up to now I wasn't able to solve it.
It started with the Metrics Collector being restarted frequently until it got stopped. So I followed the solution provided here: https://community.hortonworks.com/questions/121137/ambari-metrics-collector-restarting-again-and-aga... (stopped the AMS completly, deleted the Hbase files, etc.)
Now when I start the metric collector again, Ambari shows the alert
Connection failed: [Errno 111] Connection refused to cgihdp4.localnet:6188
When I check this on the node, the alert is clear:
[root@cgihdp4 ~]# netstat -tulpn | grep 6188 [root@cgihdp4 ~]#
No process is listening on the port, so I stopped and restarted the AMS on that node again:
[root@cgihdp4 ~]# ambari-metrics-collector status AMS is not running. [root@cgihdp4 ~]# ambari-metrics-collector start Sa 3. Feb 11:18:07 CET 2018 Starting HBase. starting master, logging to /var/log/ambari-metrics-collector/hbase-root-master-cgihdp4.out Verifying ambari-metrics-collector process status... Sa 3. Feb 11:18:10 CET 2018 Collector successfully started. Sa 3. Feb 11:18:10 CET 2018 Initializing Ambari Metrics data model Sa 3. Feb 11:18:27 CET 2018 Ambari Metrics data model initialization check 1 Sa 3. Feb 11:18:42 CET 2018 Ambari Metrics data model initialization check 2 Sa 3. Feb 11:18:58 CET 2018 Ambari Metrics data model initialization check 3 Sa 3. Feb 11:19:13 CET 2018 Ambari Metrics data model initialization check 4 Sa 3. Feb 11:19:30 CET 2018 Ambari Metrics data model initialization check 5 Sa 3. Feb 11:19:45 CET 2018 Ambari Metrics data model initialization check 6 Sa 3. Feb 11:20:01 CET 2018 Ambari Metrics data model initialization check 7 Sa 3. Feb 11:20:16 CET 2018 Ambari Metrics data model initialization check 8 Sa 3. Feb 11:20:34 CET 2018 Ambari Metrics data model initialization check 9 Sa 3. Feb 11:20:49 CET 2018 Ambari Metrics data model initialization check 10 [root@cgihdp4 ~]# ambari-metrics-collector status AMS is running as process 32154. [root@cgihdp4 ~]# netstat -tulpn | grep 6188 [root@cgihdp4 ~]# ps -ef | grep 32154 root 8187 31808 0 11:46 pts/0 00:00:00 grep 32154 root 32154 1 1 11:18 pts/0 00:00:24 /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xms640m -Xmx640m -Djava.library.path=/usr/lib/ams-hbase/lib/hadoop-native -Djava.security.auth.login.config=/etc/ams-hbase/conf/ams_collector_jaas.conf -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/ambari-metrics-collector/collector-gc.log-201802031118 -cp /usr/lib/ambari-metrics-collector/*:/etc/ambari-metrics-collector/conf -Djava.net.preferIPv4Stack=true -Dams.log.dir=/var/log/ambari-metrics-collector -Dproc_timelineserver org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer
Looks like I miss an important point here? I checked the logs, where I see the following messages in /var/log/ambari-metrics-collector/hbase-ams-master-cgihdp4.log:
2018-02-03 11:43:59,838 WARN [ProcedureExecutorThread-2] procedure.CreateTableProcedure: The table SYSTEM.CATALOG does not exist in meta but has a znode. run hbck to fix inconsistencies. ... 2018-02-03 11:50:19,149 ERROR [cgihdp4.localnet,61300,1517601244862_ChoreService_1] master.BackupLogCleaner: Failed to get hbase:backup table, therefore will keep all files [stacktrace removed...] 2018-02-03 11:51:14,716 INFO [timeline] timeline.HadoopTimelineMetricsSink: Unable to connect to collector, http://cgihdp4.localnet:6188/ws/v1/timeline/metrics This exceptions will be ignored for next 100 times 2018-02-03 11:51:14,717 WARN [timeline] timeline.HadoopTimelineMetricsSink: Unable to send metrics to collector by address:http://cgihdp4.localnet:6188/ws/v1/timeline/metrics
Some odd things I also noticed when trying to follow above mentioned solution:
Any ideas on how to resolve it (I will try to run hbck as mentioned in the log)?
Created 02-09-2018 06:10 PM
I tried some things, after changing permissions on the hdfs trash and cleaning up again the dirs as per https://community.hortonworks.com/questions/121137/ambari-metrics-collector-restarting-again-and-aga...
I have been able to start the ambari metrics collector and it looks like it is running continuously now. Still when I turn off maintenance mode, I get the alert back
Connection failed:[Errno111]Connection refused to cgihdp4.localnet:6188
As far as I know 6188 is the port of the timeline server. When checking this, the timeline server service is not even installed on the cgihdp4, but is up and running on cgihdp1. So I searched for the config of the timeline server, which is in Ambari below the section Advanced ams-site -> timeline.metrics.service.webapp.address, and the address mentioned there is non surprisingly cgihdp4.localnet:6188, changed this to cgihdp1.localnet:6188, restarted the metrics collector and things are running smoothly.
So basically just a stupid config error, embarassing, but many thanks @Jay Kumar SenSharma for supporting me on this issue.
Created 02-03-2018 08:48 PM
In the following command i see that you started AMS using root user:
[root@cgihdp4 ~]# ambari-metrics-collector start
Is there any specific reason that you are starting your AMS collector as "root" user?
AMS processes like AMS Collector and AMS monitors are supposed to be started with the "ams" user.
When you will try to start the AMS collector with root user then it will change prmission of many directories so later you might face issues while starting it using "ams" user.
Created 02-04-2018 01:39 PM
@Jay Kumar SenSharma : Thanks for your answer, I simply wasn't aware that the process will change directory permissions, the only reason i used root to start it was that I tried to make sure that any issue I experience wasn't due to lacking permissions.
In the meantime the service has stopped itself:
[root@cgihdp4 ~]# ambari-metrics-collector status AMS is not running. [root@cgihdp4 ~]# su - ams [ams@cgihdp4 ~]$ ambari-metrics-collector status AMS is not running. [ams@cgihdp4 ~]$ ambari-metrics-collector start tee: /var/log/ambari-metrics-collector/ambari-metrics-collector-startup.out: Permission denied Sun Feb 4 14:31:19 CET 2018 Starting HBase. tee: /var/log/ambari-metrics-collector/ambari-metrics-collector-startup.out: Permission denied master is running as process 23182. Continuing master running as process 23182. Stop it first. tee: /var/log/ambari-metrics-collector/ambari-metrics-collector-startup.out: Permission denied Verifying ambari-metrics-collector process status... Sun Feb 4 14:31:21 CET 2018 Collector successfully started. Sun Feb 4 14:31:21 CET 2018 Initializing Ambari Metrics data model ... [ams@cgihdp4 ~]$ ambari-metrics-collector status AMS is running as process 22414.
I guess the permission denied is caused by what you just pointed out, so I will change this again, but I am confused about 'master is running as process 23182', which is the Hbase Master, running with user 'ams', but does it indicate an issue now? Otherwise nothing changed now, still no process listening to port 6188
Created 02-04-2018 01:27 PM
There is no tool hbck available on the nodes, so I couldn't try that
Created 02-09-2018 06:10 PM
I tried some things, after changing permissions on the hdfs trash and cleaning up again the dirs as per https://community.hortonworks.com/questions/121137/ambari-metrics-collector-restarting-again-and-aga...
I have been able to start the ambari metrics collector and it looks like it is running continuously now. Still when I turn off maintenance mode, I get the alert back
Connection failed:[Errno111]Connection refused to cgihdp4.localnet:6188
As far as I know 6188 is the port of the timeline server. When checking this, the timeline server service is not even installed on the cgihdp4, but is up and running on cgihdp1. So I searched for the config of the timeline server, which is in Ambari below the section Advanced ams-site -> timeline.metrics.service.webapp.address, and the address mentioned there is non surprisingly cgihdp4.localnet:6188, changed this to cgihdp1.localnet:6188, restarted the metrics collector and things are running smoothly.
So basically just a stupid config error, embarassing, but many thanks @Jay Kumar SenSharma for supporting me on this issue.