Support Questions

Find answers, ask questions, and share your expertise

ambari metrics collector going down

avatar

Ambari metrics collector is going down because of lack of thread pools. How to increase the thread pool size for ambari metrics hbase. we are running hbase for metrics in distributed mode. Collector gows down within 5 minutes after restart.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@ARUN

Based on the logs, it seems one or more components are flooding the system with too many metrics. It could be the cluster HBase Service.

Can you check if the last 2 lines in the files mentioned in https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_reference_guide/content/_enabling... are not commented out?

The last 2 lines should look like this (and should not be commented out).

*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter

hbase.*.source.filter.exclude=*Regions*

Restart HBase Service after these changes.

Also, for a 30 node cluster, AMS should work fine with embedded mode, writing data to local disk. Your cluster AMS is configured to distributed mode where AMS HBase writes to cluster HDFS. Do you have a local datanode on Metrics collector host?

View solution in original post

11 REPLIES 11

avatar
Super Guru

@ARUN Can you please share the error that you see? I assume this is from the AMS log files.

avatar

Hi @Josh Elser, i have attached the metrics log. I do see some strange yarn related errors too in metrics log for the first time 😉metrics-log.txt

avatar
Super Guru

Hah, yes, it seems like you have a port conflict problem.

You could use a tool like netstat to find what process has already bound the port 60020, e.g. `sudo netstat -nape | fgrep 60020`. You can find the pid of the process which has that port bound. Once you identify the other process, you can determine if there is a port conflict which needs to be changed via configuration.

One important note is that 60020 is in the Ephemeral port range which means that there may be transient sockets binding that port. If you do not see any service bound on that port now, this is likely what happened. You can try to just restart the AMS in this case. This is the reason that HBase default ports moved from 600xx to 160xx in recent versions.

avatar

Also i do see errors like this too

ERROR org.apache.hadoop.hbase.client.AsyncProcess: Internal AsyncProcess #1 error for METRIC_RECORD_MINUTE processing for local,61320,1481772798476 java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@db14f9e rejected from java.util.concurrent.ThreadPoolExecutor@a8ef1a0[Shutting down, pool size = 10, active threads = 10, queued tasks = 324, completed tasks = 45] at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211) at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1256) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1162) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.findAllLocationsOrFail(AsyncProcess.java:940) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:857) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1186) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveGlobalFailure(AsyncProcess.java:1153) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1100(AsyncProcess.java:575) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:718) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.sendMultiAction(AsyncProcess.java:977) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.groupAndSendMultiAction(AsyncProcess.java:886) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.resubmit(AsyncProcess.java:1186) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.receiveGlobalFailure(AsyncProcess.java:1153) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.access$1100(AsyncProcess.java:575) at org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl$SingleServerRequestRunnable.run(AsyncProcess.java:718) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@db14f9e rejected from java.util.concurrent.ThreadPoolExecutor@a8ef1a0[Shutting down, pool size = 10, active threads = 10, queued tasks = 324, completed tasks = 45] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)

avatar
Master Mentor

@ARUN

Basically there are 2 errors:

1). Address already in use (port conflict)

Caused by: java.net.BindException: Problem binding to [0.0.0.0:60200] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException

For that please check which process is consuming that port and if there is a port conflict then either change the port or Kill the other process that is consuming that port.

.

2). For the second error , Looks like a Data Corruption. I will suggest clear old AMS data.

https://cwiki.apache.org/confluence/display/AMBARI/Cleaning+up+Ambari+Metrics+System+Data

Caused by: java.io.InterruptedIOException: Interrupted calling coprocessor service org.apache.phoenix.coprocessor.generated.MetaDataProtos$MetaDataService for row \x00\x00METRIC_RECORD
        at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1769)
        at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1719)
        at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1022)

- Shut down AMS and then Clear out the "/var/lib/ambari-metrics-collector" dir for fresh restart:

- From Ambari -> Ambari Metrics -> Config -> Advanced ams-hbase-site get the "hbase.rootdir" and "hbase-tmp" directory - Delete or Move the hbase-tmp and hbase.rootdir directories to an archive folder

- Then Re-Started AMS.

avatar
Super Collaborator

@ARUN,

What is the size of the cluster?

Can we have the following config items ?

/etc/ambari-metrics-collector/conf - ams-site.xml, ams-env.sh

/etc/ams-hbase/conf - hbase-site.xml, hbase-env.sh

Also, the response of http:<AMS_HOST>:6188/ws/v1/timeline/metrics/metadata

avatar

Hi @Aravindan Vijayan, Sorry for the delayed reply. our cluster size is 30 nodes.

I have attached the details you have asked for and this is the output of

http:<AMS_HOST>:6188/ws/v1/timeline/metrics/metadata

{"timestamp":0,"starttime":0,"metrics":{}}hbase-site.xmlams-site.xmlams-env.txtams-env.txthbase-env.txt

Do we need to increase any of the parameters for metrics

avatar

Hi @Aravindan Vijayan, I cleared of the ambari metrics data and restarted metrics again. but collector went down again with the following error. Guess it is due to lack of resources. can you please point out the parameter that needs to be increased for our cluster configuration. I have given the cluster details in the previous message. it is a 30 node cluster and we have 256 gb ram in 28 slave nodes. also i have attached the entire log after today's restart.ambari-metrics-collector.zip.

2017-01-04 02:22:52,065 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 4  actions to finish
2017-01-04 02:22:52,065 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 4  actions to finish
2017-01-04 02:22:52,065 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 4  actions to finish
2017-01-04 02:22:52,066 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 18  actions to finish
2017-01-04 02:22:52,067 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 1879  actions to finish
2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 6  actions to finish
2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 74  actions to finish
2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 121  actions to finish
2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 74  actions to finish
2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 6  actions to finish
2017-01-04 02:22:53,877 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 74  actions to finish
2017-01-04 02:22:53,880 INFO org.apache.hadoop.hbase.client.AsyncProcess: #1, waiting for 43  actions to finish

avatar
Super Collaborator

@ARUN

Based on the logs, it seems one or more components are flooding the system with too many metrics. It could be the cluster HBase Service.

Can you check if the last 2 lines in the files mentioned in https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_reference_guide/content/_enabling... are not commented out?

The last 2 lines should look like this (and should not be commented out).

*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter

hbase.*.source.filter.exclude=*Regions*

Restart HBase Service after these changes.

Also, for a 30 node cluster, AMS should work fine with embedded mode, writing data to local disk. Your cluster AMS is configured to distributed mode where AMS HBase writes to cluster HDFS. Do you have a local datanode on Metrics collector host?