Support Questions

Find answers, ask questions, and share your expertise

Cannot start Ambari-metrics-collector

avatar
Contributor

Hi,

I am having difficulties getting the ambari-metrics-collector to start. I have HBase running in distributed mode.

ambari-metrics-collectorlog.txtI have attached the ambari-metrics-collector.log

I already tried the suggestions from this thread: https://community.hortonworks.com/questions/15818/ambari-metrics-collector-now-starting.html as well as the workaround for issue 6 here https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues

Any tips will be very appreciated.

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@Angel Kafazov Were you able to verify the AMS keytabs work? Most of the config changes performed above were not needed, example changes to zookeeper and znode settings : For distributed mode only config changes needed are these:

https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.1.0/bk_ambari_reference_guide/content/_configur...

When you enable security through Ambari the keytabs and principals are generated by Ambari and applied to AMS configs.

Before looking into ambari-metrics-collector.log or ambari-metrics-monitor.out, the ams-hbase daemon should be up and running fine, if not the connection timeouts are of no help since these are expected. Based on the hbase logs posted the HBase daemon tried to login and failed, so we need to figure out why it did fail. Note: If the collector was moved older keytabs would become invalid because hostname changed and would have to be re-generated.

Example of keytab commands:

http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP1/HDP-1.2.0/bk_installing_manually_book/...

View solution in original post

24 REPLIES 24

avatar
Master Mentor

@Angel Kafazov

Did you take note of no.9 "zookeeper.znode.parent" and restart all the components

avatar
Super Collaborator

@Geoffrey Shelton Okot The document suggests setting the zookeeper.znode.parent to the same value as HBase service which is somewhat incorrect for versions > 2.1.2.1. Prior to 2.2, the value was not set so it defaulted to /hbase, this was ok because AMS started its own ZK. Post 2.2, AMS talks to cluster ZK and thereby the znode is set to /ams-hbase-(unsecure/secure).

Logs would indicate whether TGT was acquired correctly and if the problem is something totally different.

avatar
Contributor

Hi Geoffrey,

In HBase config I have:

zookeeper.znode.parent=/hbase-secure

In Ambari metrics:

zookeeper.znode.parent=/ams-hbase-secure

I am not sure if those must be the same.

avatar
Contributor

Hi Geoffrey,

Tried the doc and reinstalling the service, but it hangs while starting the metrics collector again. This is from the log file

at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-07-07 01:42:14,211 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 2016-07-07 01:42:14,212 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-07-07 01:42:15,309 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 2016-07-07 01:42:15,311 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)

avatar
Contributor

Hi Geoffrey,

I reinstalled the service in ambari but it hanged while trying to start the metrics collector again. This is the error from the log file:

2016-07-07 01:42:14,211 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 2016-07-07 01:42:14,212 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

avatar
Super Collaborator

@Angel Kafazov We need to look at the Master, RS and ZK logs to identify the issue. Can you upload them from /var/log/ambari-metrics-collector/ ?

avatar
Contributor

avatar
Super Collaborator

Both Master and Regions server logs indicate unable to login to Kerberos.

Login failure for amshbasemaster/m2.DOMAIN@DOMAIN from keytab /etc/security/keytabs/ams-hbase.regionserver.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user

Can you check if you can manually login with the ams keytabs under /etc/security/keytabs/ams*.keytab

Example: kinit -kt /etc/security/keytabs/ams-hbase.regionserver.keytab amshbasemaster/m2.DOMAIN@DOMAIN

avatar
Contributor

I have attached the master, zk and rs logs.

hbase-ams-master-m2-trunkatedlog.txt

hbase-ams-zookeeper-m2log.txt

hbase-ams-regionserver-m2log.txt

There seems to be authentication issue in the RS:

2016-07-06 11:42:43,688 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: Failed construction of Regionserver: class org.apache.hadoop.hbase.regionserver.HRegionServer
at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2636)
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:64)
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2651)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2634)
... 5 more
Caused by: java.io.IOException: Login failure for amshbasemaster/m2.DOMAIN@DOMAIN from keytab /etc/security/keytabs/ams-hbase.regionserver.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user
at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:962)
at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:246)
at org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:386)
at org.apache.hadoop.hbase.security.User.login(User.java:253)
at org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:115)
at org.apache.hadoop.hbase.regionserver.HRegionServer.login(HRegionServer.java:612)
at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:550)
... 10 more
Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user
at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:897)
at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:760)
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:953)
... 16 more

ZOOKEEPER:

ERROR [main] quorum.QuorumPeerConfig: Invalid configuration, only one server specified (ignoring)

MASTER:

2016-07-06 10:48:18,075 WARN  [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Session 0x155bfd1eb150003 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2016-07-06 10:48:20,103 INFO  [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
2016-07-06 10:48:20,104 WARN  [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Session 0x155bfd1eb150003 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2016-07-06 10:48:21,724 INFO  [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error)
2016-07-06 10:48:21,725 WARN  [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Session 0x155bfd1eb150003 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2016-07-06 10:48:21,825 WARN  [RS:0;m2:49385] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/rs/m2.tmaut.tlabsdata.com,49385,1467802057410
2016-07-06 10:48:21,826 ERROR [RS:0;m2:49385] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 4 attempts
2016-07-06 10:48:21,826 WARN  [RS:0;m2:49385] regionserver.HRegionServer: Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/rs/m2.tmaut.tlabsdata.com,49385,1467802057410
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1345)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1334)
at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1403)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1079)
at java.lang.Thread.run(Thread.java:745)

avatar
Contributor

So I removed the ambari-metrics service and added it again (moving to another node didn't work). I also made some changes:

- switch to distributed mode

- modified zookeeper.znode.parent=/hbase-secure

- manually recreated ams.collector.keytab and zk.service.keytab due to authentication errors in the log

- changed hbase. zookeeper.property.clientPort to 2181 from 61181

- changed rootdir from local to HDFS

I think I am getting cluse as AMS can connect to zookeeper:

INFO org.apache.phoenix.query.ConnectionQueryServicesImpl: Successfull login to secure cluster!!

However I am getting error connecting to HBase

WARN org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.query.DefaultPhoenixDataSource: Unable to connect to HBase store using Phoenix.
java.sql.SQLException: ERROR 103 (08004): Unable to establish connection.

I'll attached the logs in a comment