Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Ambari metric collecto down

avatar
Explorer

Hi

could you please help me, I am new in ambari metrics, my problem is after ambari upgrade, and ambari metrics ugrade, the ambari-metrics-collector going down after starte it, below the message error from log

INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server name_of_host/x.x.x.x:61181. Will not attempt to authenticate using SASL (unknown error)
WARN org.apache.zookeeper.ClientCnxn: Session 0x65464619r4fd7e for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=name_of_host:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-secure/meta-region-server

1 ACCEPTED SOLUTION

avatar

Hi @lam rab ,

By the error it looks like zookeeper is having some issue and its not able to connect to zookeeper.

If the AMS metrics history data is not important for you and you need to make the service up.

can you try performing : https://cwiki.apache.org/confluence/display/AMBARI/Cleaning+up+Ambari+Metrics+System+Data

Remove the AMS zookeeper data by backing up and removing the contents of 'hbase.tmp.dir'/zookeeper

and see if this helps ?

Also Please make sure AMS Heap configurations are good : https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.0/bk_ambari-operations/content/ams_general_gu...

Please accept my answer if you found this helpful.

View solution in original post

9 REPLIES 9

avatar

Hi @lam rab,

Have you done the mandatory post upgrade tasks related to ambari upgrade ?

I hope if you perform this command, the output will have every ambari versions same

rpm -qa |grep -i ambari 

refer to this doc for more (this is for ambari-2.4.3 , choose for your version of ambari ) : https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.3.0/bk_ambari-upgrade/content/upgrade_ambari_me...

Please mark answer as accepted if its helpful

avatar
Explorer

Hi @Akhil S Naik

Yes, i do all post upgrade tasks of ambari upgrade specialy for ambari-metrics,


below the output

[root@myhost ~]# rpm -qa |grep -i ambari
ambari-agent-2.6.2.2-1.x86_64
ambari-metrics-collector-2.6.2.2-1.x86_64
ambari-metrics-hadoop-sink-2.6.2.2-1.x86_64
ambari-metrics-monitor-2.6.2.2-1.x86_64

avatar
Contributor

Hi @lam rab

Is the issue resolved? If yes, please let me know how it was done.

Else, Is your cluster kerberized? Can you also add the hbase logs inside metrics collector?

Few attempts which i tried:

In ambari, goto the host where Metric collector is installed and refresh the configs and try again to restart metrics collector.

The issues which i have faced till, the issue is due to either the values stored in zkClient or something wrong in metric collector files stored on the hosts.

If you don't need the previous metrics stored, you can follow the below steps "at your own risk"

1. Stop all the services of metric collector, metric monitor and grafana.

2. Delete the service.

3. Rename/Delete the folder ambari-metrics-collector at path /var/log/var/lib/ and /var/var/lib/

4. Add the service Ambari Metrics from Ambari again.

The above worked for me.

avatar

Hi @lam rab ,

were you able to resolve the issue. i see the exception you posted happens mostly due to upgrade only..

If not please attach ambari-metrics collector logs .

avatar
Explorer

Hi

I am not resolved to problem yet, here are the log of ambari-metrics et hbase-ams

thanks

ambari-metrics-collector log:

--------------------------------

2018-08-31 17:17:29,367 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server host102/x.x.x.x:61181. Will not attempt to authenticate using SASL (unknown error)
2018-08-31 17:17:29,367 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2018-08-31 17:17:29,834 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server host102/x.x.x.x:61181. Will not attempt to authenticate using SASL (unknown error)
2018-08-31 17:17:29,834 WARN org.apache.zookeeper.ClientCnxn: Session 0x1659083ffc20001 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2018-08-31 17:17:29,991 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server host102/x.x.x.x:61181. Will not attempt to authenticate using SASL (unknown error)
2018-08-31 17:17:29,991 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2018-08-31 17:17:30,091 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=host102:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-secure/meta-region-server

###################################################################

hbase-ams log

--------------------------------------------------

2018-08-31 17:08:13,773 INFO [main-SendThread(host102:61181)] zookeeper.ClientCnxn: Opening socket connection to server host102/x.x.x.x:61181. Will not attempt to authenticate using SASL (unknown error)
2018-08-31 17:08:13,773 WARN [main-SendThread(host102:61181)] zookeeper.ClientCnxn: Session 0x1659083ffc20002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2018-08-31 17:08:14,944 INFO [RS:0;host102:50466-SendThread(host102:61181)] zookeeper.ClientCnxn: Opening socket connection to server host102/x.x.x.x:61181. Will not attempt to authenticate using SASL (unknown error)
2018-08-31 17:08:14,944 WARN [RS:0;host102:50466-SendThread(host102:61181)] zookeeper.ClientCnxn: Session 0x1659083ffc20004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2018-08-31 17:08:15,536 INFO [main-SendThread(host102:61181)] zookeeper.ClientCnxn: Opening socket connection to server host102/x.x.x.x:61181. Will not attempt to authenticate using SASL (unknown error)
2018-08-31 17:08:15,536 WARN [main-SendThread(host102:61181)] zookeeper.ClientCnxn: Session 0x1659083ffc20002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2018-08-31 17:08:16,228 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: Master not initialized after 200000ms seconds
at org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:230)
at org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:445)
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:229)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:139)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2838)

avatar

Hi @lam rab ,

By the error it looks like zookeeper is having some issue and its not able to connect to zookeeper.

If the AMS metrics history data is not important for you and you need to make the service up.

can you try performing : https://cwiki.apache.org/confluence/display/AMBARI/Cleaning+up+Ambari+Metrics+System+Data

Remove the AMS zookeeper data by backing up and removing the contents of 'hbase.tmp.dir'/zookeeper

and see if this helps ?

Also Please make sure AMS Heap configurations are good : https://docs.hortonworks.com/HDPDocuments/Ambari-2.6.2.0/bk_ambari-operations/content/ams_general_gu...

Please accept my answer if you found this helpful.

avatar
Master Mentor

@lam rab

Cleaning up the AMS data would remove all the historical AMS data available

Step-by-step guide

1.Using Ambari

a.Set AMS to maintenance

b.Stop AMS from Ambari

c.Identify the following from the AMS Configs screen

i.'Metrics Service operation mode' (embedded or distributed)

ii.hbase.rootdir iii.hbase.zookeeper.property.dataDir

2.AMS data would be stored in 'hbase.rootdir' identified above. Backup and remove the AMS data.

a. If the Metrics Service operation mode

i.is 'embedded', then the data is stored in OS files.

Use regular OS commands to backup and remove the files in hbase.rootdir

ii.is 'distributed', then the data is stored in HDFS.

Use 'hdfs dfs ' commands to backup and remove the files in hbase.rootdir

3. Remove the AMS zookeeper data by backing up and removing the contents of 'hbase.tmp.dir'/zookeeper

4.Remove any Phoenix spool files from 'hbase.tmp.dir'/phoenix-spool folder

5.Restart AMS using Ambari

HTH

avatar
Explorer

Hi all,

Thanks for help, the problem is solved by cleaning up the AMS data. but i am still not understand why this happening after upgarde.

Regards

avatar
Master Mentor

@lam rab

Did you do the post upgrade steps mentionned in Migrate Ambari Metrics Data that could be probably the cause. Never ignore researching before any upgrade so as not to miss some post-upgrade Tasks.

Upgrades are NEVER all smooth otherwise no fun 🙂
Please accept the answer to close the thread.