Created 12-20-2016 04:47 PM
Ambari metrics collector is going down because of lack of thread pools. How to increase the thread pool size for ambari metrics hbase. we are running hbase for metrics in distributed mode. Collector gows down within 5 minutes after restart.
Created 01-04-2017 11:11 PM
Based on the logs, it seems one or more components are flooding the system with too many metrics. It could be the cluster HBase Service.
Can you check if the last 2 lines in the files mentioned in https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_reference_guide/content/_enabling... are not commented out?
The last 2 lines should look like this (and should not be commented out).
*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter hbase.*.source.filter.exclude=*Regions*
Restart HBase Service after these changes.
Also, for a 30 node cluster, AMS should work fine with embedded mode, writing data to local disk. Your cluster AMS is configured to distributed mode where AMS HBase writes to cluster HDFS. Do you have a local datanode on Metrics collector host?
Created 12-20-2016 04:50 PM
@ARUN Can you please share the error that you see? I assume this is from the AMS log files.
Created 12-20-2016 05:17 PM
Hi @Josh Elser, i have attached the metrics log. I do see some strange yarn related errors too in metrics log for the first time 😉metrics-log.txt
Created 12-20-2016 06:14 PM
Hah, yes, it seems like you have a port conflict problem.
You could use a tool like netstat to find what process has already bound the port 60020, e.g. `sudo netstat -nape | fgrep 60020`. You can find the pid of the process which has that port bound. Once you identify the other process, you can determine if there is a port conflict which needs to be changed via configuration.
One important note is that 60020 is in the Ephemeral port range which means that there may be transient sockets binding that port. If you do not see any service bound on that port now, this is likely what happened. You can try to just restart the AMS in this case. This is the reason that HBase default ports moved from 600xx to 160xx in recent versions.
Created 12-20-2016 05:20 PM
Also i do see errors like this too
ERROR org.apache.hadoop.hbase.client.AsyncProcess: Internal AsyncProcess #1 error for METRIC_RECORD_MINUTE processing for local,61320,1481772798476 java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@db14f9e rejected from java.util.concurrent.ThreadPoolExecutor@a8ef1a0[Shutting down, pool size = 10, active threads = 10, queued tasks = 324, completed tasks = 45] at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211) at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1256) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1162) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.findAllLocationsOrFail(AsyncProcess.java:940) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:857) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1186) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveGlobalFailure(AsyncProcess.java:1153) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1100(AsyncProcess.java:575) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:718) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.sendMultiAction(AsyncProcess.java:977) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:886) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1186) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveGlobalFailure(AsyncProcess.java:1153) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1100(AsyncProcess.java:575) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:718) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@db14f9e rejected from java.util.concurrent.ThreadPoolExecutor@a8ef1a0[Shutting down, pool size = 10, active threads = 10, queued tasks = 324, completed tasks = 45] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
Created 12-20-2016 05:30 PM
Basically there are 2 errors:
1). Address already in use (port conflict)
Caused by: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
For that please check which process is consuming that port and if there is a port conflict then either change the port or Kill the other process that is consuming that port.
.
2). For the second error , Looks like a Data Corruption. I will suggest clear old AMS data.
https://cwiki.apache.org/confluence/display/AMBARI/Cleaning+up+Ambari+Metrics+System+Data
Caused by: java.io.InterruptedIOException: Interrupted calling coprocessor service org.apache.phoenix.coprocessor.generated.MetaDataProtos$MetaDataService for row \x00\x00METRIC_RECORD at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1769) at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1719) at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1022)
- Shut down AMS and then Clear out the "/var/lib/ambari-metrics-collector" dir for fresh restart:
- From Ambari -> Ambari Metrics -> Config -> Advanced ams-hbase-site get the "hbase.rootdir" and "hbase-tmp" directory - Delete or Move the hbase-tmp and hbase.rootdir directories to an archive folder
- Then Re-Started AMS.
Created 12-20-2016 10:13 PM
What is the size of the cluster?
Can we have the following config items ?
/etc/ambari-metrics-collector/conf - ams-site.xml, ams-env.sh
/etc/ams-hbase/conf - hbase-site.xml, hbase-env.sh
Also, the response of http:<AMS_HOST>:6188/ws/v1/timeline/metrics/metadata
Created 12-28-2016 05:54 AM
Hi @Aravindan Vijayan, Sorry for the delayed reply. our cluster size is 30 nodes.
I have attached the details you have asked for and this is the output of
http:<AMS_HOST>:6188/ws/v1/timeline/metrics/metadata
{"timestamp":0,"starttime":0,"metrics":{}}hbase-site.xmlams-site.xmlams-env.txtams-env.txthbase-env.txt
Do we need to increase any of the parameters for metrics
Created 01-04-2017 08:51 AM
Hi @Aravindan Vijayan, I cleared of the ambari metrics data and restarted metrics again. but collector went down again with the following error. Guess it is due to lack of resources. can you please point out the parameter that needs to be increased for our cluster configuration. I have given the cluster details in the previous message. it is a 30 node cluster and we have 256 gb ram in 28 slave nodes. also i have attached the entire log after today's restart.ambari-metrics-collector.zip.
2017-01-04 02:22:52,065 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 4 actions to finish 2017-01-04 02:22:52,065 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 4 actions to finish 2017-01-04 02:22:52,065 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 4 actions to finish 2017-01-04 02:22:52,066 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 18 actions to finish 2017-01-04 02:22:52,067 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 1879 actions to finish 2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 6 actions to finish 2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 74 actions to finish 2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 121 actions to finish 2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 74 actions to finish 2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 6 actions to finish 2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 74 actions to finish 2017-01-04 02:22:53,880 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 43 actions to finish
Created 01-04-2017 11:11 PM
Based on the logs, it seems one or more components are flooding the system with too many metrics. It could be the cluster HBase Service.
Can you check if the last 2 lines in the files mentioned in https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_reference_guide/content/_enabling... are not commented out?
The last 2 lines should look like this (and should not be commented out).
*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter hbase.*.source.filter.exclude=*Regions*
Restart HBase Service after these changes.
Also, for a 30 node cluster, AMS should work fine with embedded mode, writing data to local disk. Your cluster AMS is configured to distributed mode where AMS HBase writes to cluster HDFS. Do you have a local datanode on Metrics collector host?