Created 07-06-2016 03:33 PM
Hi,
I am having difficulties getting the ambari-metrics-collector to start. I have HBase running in distributed mode.
ambari-metrics-collectorlog.txtI have attached the ambari-metrics-collector.log
I already tried the suggestions from this thread: https://community.hortonworks.com/questions/15818/ambari-metrics-collector-now-starting.html as well as the workaround for issue 6 here https://cwiki.apache.org/confluence/display/AMBARI/Known+Issues
Any tips will be very appreciated.
Created 07-09-2016 06:28 PM
@Angel Kafazov Were you able to verify the AMS keytabs work? Most of the config changes performed above were not needed, example changes to zookeeper and znode settings : For distributed mode only config changes needed are these:
When you enable security through Ambari the keytabs and principals are generated by Ambari and applied to AMS configs.
Before looking into ambari-metrics-collector.log or ambari-metrics-monitor.out, the ams-hbase daemon should be up and running fine, if not the connection timeouts are of no help since these are expected. Based on the hbase logs posted the HBase daemon tried to login and failed, so we need to figure out why it did fail. Note: If the collector was moved older keytabs would become invalid because hostname changed and would have to be re-generated.
Example of keytab commands:
Created 07-07-2016 04:36 AM
Did you take note of no.9 "zookeeper.znode.parent" and restart all the components
Created 07-07-2016 06:02 AM
@Geoffrey Shelton Okot The document suggests setting the zookeeper.znode.parent to the same value as HBase service which is somewhat incorrect for versions > 2.1.2.1. Prior to 2.2, the value was not set so it defaulted to /hbase, this was ok because AMS started its own ZK. Post 2.2, AMS talks to cluster ZK and thereby the znode is set to /ams-hbase-(unsecure/secure).
Logs would indicate whether TGT was acquired correctly and if the problem is something totally different.
Created 07-07-2016 07:08 AM
Hi Geoffrey,
In HBase config I have:
zookeeper.znode.parent=/hbase-secure
In Ambari metrics:
zookeeper.znode.parent=/ams-hbase-secure
I am not sure if those must be the same.
Created 07-09-2016 10:29 AM
Hi Geoffrey,
Tried the doc and reinstalling the service, but it hangs while starting the metrics collector again. This is from the log file
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-07-07 01:42:14,211 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 2016-07-07 01:42:14,212 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-07-07 01:42:15,309 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 2016-07-07 01:42:15,311 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
Created 07-09-2016 10:29 AM
Hi Geoffrey,
I reinstalled the service in ambari but it hanged while trying to start the metrics collector again. This is the error from the log file:
2016-07-07 01:42:14,211 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 2016-07-07 01:42:14,212 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Created 07-07-2016 04:19 AM
@Angel Kafazov We need to look at the Master, RS and ZK logs to identify the issue. Can you upload them from /var/log/ambari-metrics-collector/ ?
Created 07-07-2016 02:38 PM
Created 07-07-2016 07:03 PM
Both Master and Regions server logs indicate unable to login to Kerberos.
Login failure for amshbasemaster/m2.DOMAIN@DOMAIN from keytab /etc/security/keytabs/ams-hbase.regionserver.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user
Can you check if you can manually login with the ams keytabs under /etc/security/keytabs/ams*.keytab
Example: kinit -kt /etc/security/keytabs/ams-hbase.regionserver.keytab amshbasemaster/m2.DOMAIN@DOMAIN
Created 07-07-2016 07:50 PM
I have attached the master, zk and rs logs.
hbase-ams-master-m2-trunkatedlog.txt
hbase-ams-regionserver-m2log.txt
There seems to be authentication issue in the RS:
2016-07-06 11:42:43,688 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting java.lang.RuntimeException: Failed construction of Regionserver: class org.apache.hadoop.hbase.regionserver.HRegionServer at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2636) at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:64) at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126) at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2651) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2634) ... 5 more Caused by: java.io.IOException: Login failure for amshbasemaster/m2.DOMAIN@DOMAIN from keytab /etc/security/keytabs/ams-hbase.regionserver.keytab: javax.security.auth.login.LoginException: Unable to obtain password from user at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:962) at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:246) at org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:386) at org.apache.hadoop.hbase.security.User.login(User.java:253) at org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:115) at org.apache.hadoop.hbase.regionserver.HRegionServer.login(HRegionServer.java:612) at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:550) ... 10 more Caused by: javax.security.auth.login.LoginException: Unable to obtain password from user at com.sun.security.auth.module.Krb5LoginModule.promptForPass(Krb5LoginModule.java:897) at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:760) at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) at javax.security.auth.login.LoginContext.login(LoginContext.java:587) at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:953) ... 16 more
ZOOKEEPER:
ERROR [main] quorum.QuorumPeerConfig: Invalid configuration, only one server specified (ignoring)
MASTER:
2016-07-06 10:48:18,075 WARN [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Session 0x155bfd1eb150003 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-07-06 10:48:20,103 INFO [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 2016-07-06 10:48:20,104 WARN [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Session 0x155bfd1eb150003 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-07-06 10:48:21,724 INFO [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:61181. Will not attempt to authenticate using SASL (unknown error) 2016-07-06 10:48:21,725 WARN [main-SendThread(localhost:61181)] zookeeper.ClientCnxn: Session 0x155bfd1eb150003 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2016-07-06 10:48:21,825 WARN [RS:0;m2:49385] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/rs/m2.tmaut.tlabsdata.com,49385,1467802057410 2016-07-06 10:48:21,826 ERROR [RS:0;m2:49385] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 4 attempts 2016-07-06 10:48:21,826 WARN [RS:0;m2:49385] regionserver.HRegionServer: Failed deleting my ephemeral node org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/rs/m2.tmaut.tlabsdata.com,49385,1467802057410 at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178) at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1345) at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1334) at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1403) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1079) at java.lang.Thread.run(Thread.java:745)
Created 07-08-2016 11:59 AM
So I removed the ambari-metrics service and added it again (moving to another node didn't work). I also made some changes:
- switch to distributed mode
- modified zookeeper.znode.parent=/hbase-secure
- manually recreated ams.collector.keytab and zk.service.keytab due to authentication errors in the log
- changed hbase. zookeeper.property.clientPort to 2181 from 61181
- changed rootdir from local to HDFS
I think I am getting cluse as AMS can connect to zookeeper:
INFO org.apache.phoenix.query.ConnectionQueryServicesImpl: Successfull login to secure cluster!!
However I am getting error connecting to HBase
WARN org.apache.hadoop.yarn.server.applicationhistoryservice.metrics.timeline.query.DefaultPhoenixDataSource: Unable to connect to HBase store using Phoenix. java.sql.SQLException: ERROR 103 (08004): Unable to establish connection.
I'll attached the logs in a comment