Created 02-07-2017 02:02 PM
Apache Ambari Version2.4.0.1
The alterts indicated CRITIAL because of
Metrics Collector has been auto-started 5 times since <timestamp>
This happened very very frequently.
The mainly log ambari-metrics-collector.loglike below
2017-02-07 11:48:42,465 WARN org.apache.zookeeper.ClientCnxn: Session 0x15a1698abc40001 for server humepcomp117.huawei.com/10.106.134.117:61181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) . . 2017-02-07 11:48:45,336 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/table/SYSTEM.CATALOG 2017-02-07 11:48:45,336 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure . . 2017-02-07 11:49:02,678 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 4 attempts 2017-02-07 11:49:02,678 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x433defed-0x15a1698abc40001, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Unable to get data of znode /ams-hbase-unsecure/table/METRIC_RECORD org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/table/METRIC_RECORD at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:622) at org.apache.hadoop.hbase.zookeeper.ZKTableStateClientSideReader.getTableState(ZKTableStateClientSideReader.java:185) at org.apache.hadoop.hbase.zookeeper.ZKTableStateClientSideReader.isDisabledTable(ZKTableStateClientSideReader.java:59) at org.apache.hadoop.hbase.client.ZooKeeperRegistry.isTableOnlineState(ZooKeeperRegistry.java:127) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isTableDisabled(ConnectionManager.java:960) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1129) at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:298) at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:150) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:376) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
and hbase-ams-master-humepcomp117.log
2017-02-07 11:48:43,542 INFO [main] zookeeper.ZooKeeper: Initiating client connection, connectString=humepcomp117.huawei.com:61181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@4450d156 2017-02-07 11:48:43,583 INFO [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error) 2017-02-07 11:48:43,592 WARN [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2017-02-07 11:48:43,712 WARN [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master 2017-02-07 11:48:44,702 INFO [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error) 2017-02-07 11:48:44,704 WARN [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) 2017-02-07 11:48:44,805 WARN [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=humepcomp117.huawei.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master 2017-02-07 11:48:44,805 ERROR [main] zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 1 attempts 2017-02-07 11:48:44,805 WARN [main] zookeeper.ZKUtil: clean znode for master0x0, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Unable to get data of znode /ams-hbase-unsecure/master org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712) at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267) at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529) 2017-02-07 11:48:44,807 ERROR [main] zookeeper.ZooKeeperWatcher: clean znode for master0x0, quorum=humepcomp117.huawei.com:61181, baseZNode=/ams-hbase-unsecure Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712) at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267) at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529) 2017-02-07 11:48:44,809 WARN [main] zookeeper.ZooKeeperNodeTracker: Can't get or delete the master znode org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/master at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:712) at org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.deleteIfEquals(MasterAddressTracker.java:267) at org.apache.hadoop.hbase.ZNodeClearer.clear(ZNodeClearer.java:149) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2529) 2017-02-07 11:48:45,805 INFO [main-SendThread(humepcomp117.huawei.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server humepcomp117.huawei.com/10.106.134.117:61181. Will not attempt to authenticate using SASL (unknown error) 2017-02-07 11:48:45,911 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed 2017-02-07 11:48:45,911 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
Due to networklimitations of the company, I can't paste or upload full log file, I will try to paste in Follow-up posts.
Please kindly help to solve this problem.Thanks.
I had tryied the best answer in
https://community.hortonworks.com/questions/48107/ambari-metrics-collector.html
, but no help.Thanks for your great support.
Created 02-10-2017 09:35 AM
Ideally the "MaxMetaspaceSize" has no upper limit. Please see:
http://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/considerations.html
The amount of native memory that can be used for class metadata is by default unlimited. Use the option MaxMetaspaceSize to put an upper limit on the amount of native memory used for class metadata.
.
Regarding the tuning parameter you might want to tune it as following:
Something you should try with this: ---------------------------------------- ams-hbase-env::hbase_master_heapsize 1152 MB ===>> 8192 MB ams-hbase-env::hbase_master_maxperm_size 128 MB ===>> 128 MB (or 256 MB) ams-hbase-env::hbase_regionserver_heapsize 768 MB ===>> 8192 MB ams-hbase-env::regionserver_xmn_size 128 MB ===>> 1280 MB to 1536 MB
.
In JDK 1.8 PermGen space is replaced with MetaSpace. And it is always better to set the "MaxMetaspaceSize" so that if there is any classloader leak then it will not grow beyond the MaxMetaspaceSize boundary else it may cause huge system memory utilization (in case of leak).
And also disable (exclude) HBase per region metrics to avoid data flooding.
Created 02-07-2017 03:00 PM
Along with cleaning up the recommended directories in the mentioned post, take a look at the AMS tuning guide and set your AMS accordingly:
https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning
Created 02-09-2017 01:50 AM
@icocio my cluster have 40 hosts, I now tuned parameters as >50 , I will monitor if this will help my problem.
Created 02-15-2017 07:57 AM
Solved by tuning parametes.
Created 02-07-2017 07:11 PM
Please also look at:
https://cwiki.apache.org/confluence/display/AMBARI/Troubleshooting+Guide
How big is your cluster?
Created 02-09-2017 01:41 AM
40 hosts in my cluster @swagle
Created 02-09-2017 12:53 PM
@Huahua Wei I had this issue after upgrade from Ambari 2.2.0 and also tried the solution from https://community.hortonworks.com/questions/48107/ambari-metrics-collector.html
In my case the hbase data of metrics collector got corrupt I had to delete all the contents of /var/lib/ambari-metrics-collector/hbase/
Try the following:
I hope this help
Created 02-09-2017 01:36 PM
Please try to first disable the auto start for AMS by commenting the following lines from the file "/etc/ambari-server/conf/ambari.properties". Then restart ambari server.
recovery.type=AUTO_START recovery.enabled_components=METRICS_COLLECTOR
.
Above will help us in understanding why AMS went down. Due to memory issue/overload ..etc.
Also can you please try to disable (exclude) HBase per region metrics to avoid data flooding. That can be done by explicitly adding the following lines to the end of the file:
*.source.filter.class=org.apache.hadoop.metrics2.filter.GlobFilter hbase.*.source.filter.exclude=*Regions*
For more information please refer to: https://docs.hortonworks.com/HDPDocuments/Ambari-2.2.0.0/bk_ambari_reference_guide/content/_enabling...
.
Created 02-09-2017 05:55 PM
If this is a production cluster switch to distributed mode will make use of cluster Zookeeper which will make the system a lot more stable.
Embedded mode works perfectly fine for cluster size of 40 nodes albeit, memory and disk are not heavily contended. AMS HBase will write to 1 disk and talk to the embedded zookeeper so starightforward recommendations without looking at the full logs and configs and without changing the mode:
ams-env :: metrics_collector_heapsize = 1024
ams-hbase-env :: hbase_regionserver_heapsize = 4096
Make sure hbase.rootdir and hbase.tmp.dir are not pointing to the same location.
Key is to put hbase.rootdir on a non-contended disk.
If you switch ti distributed mode the disk settings do not matter:
https://cwiki.apache.org/confluence/display/AMBARI/AMS+-+distributed+mode
Created 02-09-2017 06:06 PM
Comment from @Jay SenSharma regarding Region metrics is also important and applicable.
Note: Additionally make sure Xmn settings = 15 % of Xmx in ams-env and ams-hbase-env