Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

cant start ambari collector

avatar

we cant start the ambari collector

from the logs we got the following:

what chould be the problem ?


2018-03-11 15:50:20,249 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: hconnection-0x4bdeaabb-0x16215bf27850000, quorum=master02.sys673.com:61181, baseZNode=/ams-hbase-unsecure Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1212) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:354) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:622) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionState(MetaTableLocator.java:491) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionLocation(MetaTableLocator.java:172) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:611) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:592) at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:565) at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1195) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1162) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1136) at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:298) at org.apache.hadoop.hbase.client.ScannerCallable.prepare(ScannerCallable.java:151) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:376) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
grep error  ambari-metrics-collector.log
2018-03-11 14:59:04,182 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master02.sys673.com/130.14.52.8:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-11 14:59:04,185 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
2018-03-11 14:59:04,458 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master02.sys673.com/130.14.52.8:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-11 14:59:04,459 WARN org.apache.zookeeper.ClientCnxn: Session 0x162158dcd9a0001 for server null, unexpected error, closing socket connection and attempting reconnect
2018-03-11 14:59:05,286 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master02.sys673.com/130.14.52.8:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-11 14:59:05,286 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
2018-03-11 14:59:05,812 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master02.sys673.com/130.14.52.8:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-11 14:59:05,813 WARN org.apache.zookeeper.ClientCnxn: Session 0x162158dcd9a0001 for server null, unexpected error, closing socket connection and attempting reconnect
2018-03-11 14:59:06,388 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master02.sys673.com/130.14.52.8:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-11 14:59:06,388 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
2018-03-11 14:59:07,197 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master02.sys673.com/130.14.52.8:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-11 14:59:07,197 WARN org.apache.zookeeper.ClientCnxn: Session 0x162158dcd9a0001 for server null, unexpected error, closing socket connection and attempting reconnect
2018-03-11 14:59:07,489 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master02.sys673.com/130.14.52.8:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-11 14:59:07,490 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
2018-03-11 14:59:08,591 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master02.sys673.com/130.14.52.8:61181. Will not attempt to authenticate using SASL (unknown error)
2018-03-11 14:59:08,592 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
Michael-Bronson
1 ACCEPTED SOLUTION

avatar
Master Mentor

@Michael Bronson

Looks like AMS collector is not able to start properly and hence the zookeeper is showing connection Loss while finding the znode.

Sometimes it happens if the AMS collector is not tuned properly. Specially when the Heap Settings are not done properly according to the number of nodes present in the cluster. So can you please refer to the following doc to check if the Heap settings are fine and according to the cluster:

- https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning

- Sometimes cleaning up the "hbase.zookeeper.property.dataDir" and '${hbase.tmp.dir}/phoenix-spool temp directory content helps in making the zookeeper and spool dir data (which are temporary dirs).

View solution in original post

15 REPLIES 15

avatar
Master Mentor

@Michael Bronson

Looks like AMS collector is not able to start properly and hence the zookeeper is showing connection Loss while finding the znode.

Sometimes it happens if the AMS collector is not tuned properly. Specially when the Heap Settings are not done properly according to the number of nodes present in the cluster. So can you please refer to the following doc to check if the Heap settings are fine and according to the cluster:

- https://cwiki.apache.org/confluence/display/AMBARI/Configurations+-+Tuning

- Sometimes cleaning up the "hbase.zookeeper.property.dataDir" and '${hbase.tmp.dir}/phoenix-spool temp directory content helps in making the zookeeper and spool dir data (which are temporary dirs).

avatar

@Jay the link have a lot of info , can you please gives the important notes from the link?

Michael-Bronson

avatar

regarding to ${hbase.tmp.dir}/phoenix-spool temp directory , what is the path on ambari server?

Michael-Bronson

avatar
Master Mentor

@Michael Bronson

In the mentioned tuning link we can start from the "Suggested Memory settings" section (as most of the time incorrect memory tuning causes AMS collector to go down) to see if the memory is allocated properly to the AMS system accordingly.

The Data Dirs are as following :

hbase.tmp.dir=/var/lib/ambari-metrics-collector/hbase-tmp
hbase.zookeeper.property.dataDir=${hbase.tmp.dir}/zookeeper
phoenix.spool.directory=${hbase.tmp.dir}/phoenix-spool

.

avatar

@Jay cluster size ( as appears in the link ) is the number of workers machines?

Michael-Bronson

avatar

@jay we clean both folder and start again the metrics but we the same isshue

Michael-Bronson

avatar
Master Mentor

@Michael Bronson

The Cluster Size means the total number of hosts present in the cluster. So if the number of cluster host increases then the ambari metrics data will grow accordingly and hence we will need more data aggregation and data so in that case we will need to have the heap size set accordingly. A rough calculation of heap settings is mentioned in the above tuning doc.

- In this case as the AMS is going down very frequently then we should check the following things:

1. Usually it happens when the Heap is not set properly. Increasing the memory works in many cases. So please try to Increased heap memory for AMS collector and hbase.

2. We need to know how many nodes are there in the cluster to see if the current tuning is OK or not?

3. We can check the current AMS tunning by looking at the following logs: (So if you can attach them then it will be good to review them)

# tar czhfv ./amsc_etc_$(hostname)_$(date +"%Y%m%d%H%M%S").tar.gz /etc/ambari-metrics-collector/ 
# tar czhfv ./amsh_etc_$(hostname)_$(date +"%Y%m%d%H%M%S").tar.gz /etc/ams-hbase/ 

4. The complete collector logs along with the following from location "/var/log/ambari-metrics-collector" on AMS host.

ambari-metrics-collector.log      (latest one)
hbase-ams-master-*.log    (latest one)
collector-gc.log (latest)
gc.log   (latest)

.

avatar

@Jay we have in the cluster 3 master machines and 4 workers machine so total hosts are - 7 , so from the table we need to select the first table - (1 - 10) , an I correct ?

Michael-Bronson

avatar
Master Mentor

@Michael Bronson

Yes, correct, this one should be good.

62833-ams-tuning.png

.

So please check if the AMS collector host has enough free memory or not?

# free -m

.