I am launching hbase (1.1.2) on a kerberized cluster (AD). Hbase region server fails to connect to master with following error:
2017-01-20 18:17:23,944 WARN [regionserver/a1.example.com/xxxxx] regionserver.HRegionServer: error telling master we are up com.google.protobuf.ServiceException: java.io.IOException: Couldn't setup connection for srvuser/a1.example.com@ADC.EXAMPLE.COM to srvuser/a2.example.com@ADC.EXAMPLE.COM at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982) at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2270)
Caused by: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS initiate failed at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.readStatus(HBaseSaslRpcClient.java:153) at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:189)
I turned ON detailed debug logs for kerberos as well as Hbase. I can see that the service ticket is successfully obtained by host a1 for a2:
Found ticket for srvuser/a1.example.com@ADC.EXAMPLE.COM to go to krbtgt/ADC.EXAMPLE.COM@ADC.EXAMPLE.COM expiring on Sat Jan 21 04:17:10 PST 2017
Found ticket for srvuser/a1.example.com@ADC.EXAMPLE.COM to go to srvuser/a2.example.com@ADC.EXAMPLE.COM expiring on Sat Jan 21 04:17:10 PST 2017
Client Principal = srvuser/a1.example.com@ADC.EXAMPLE.COM
Server Principal = srvuser/a2.example.com@ADC.EXAMPLE.COM Session Key = EncryptionKey: keyType=23 keyBytes
I do not see any errors post the above lines in detailed kerberos level logs so I assume that the problem of GSS Initiate failed has not anything to do with kerberos now else I would have seen some error reported (such as say ticket being corrupted?)
I notice that GSS Initiate failed message without any details reported is specified by experts as one of the most useless messages - Steve's error messages to fear.
Already verified unlimited JCE policy files are present, and that both hosts are using the same encryption algorithm. Can anyone help here? Even if it is about what next steps I can take to debug this?
try to turn on the java kerberos logging by adding -Dsun.security.krb5.debug=true to the HADOOP_OPTS. Usually it helps to understand what exactly fails during GSS initialization.
@Sergey Soldatov - Thanks for the suggestion. I had added that flag already and that only showed me that I have a valid service ticket (as mentioned above). The SecurityAuth.audit log of hbase master shows the following error:
2017-01-20 18:17:08,221 WARN SecurityLogger.org.apache.hadoop.hbase.Server: Auth failed for x.y.z.q:55872:null
2017-01-20 18:17:11,857 WARN SecurityLogger.org.apache.hadoop.hbase.Server: Auth failed for x.y.z.q:46270:null
Not sure if this can point to any problems. The IP for which auth shows as failed above is where HRegion server is running.
"I had added that flag already and that only showed me that I have a valid service ticket (as mentioned above)."
Would suggest sharing that information anyways. Sometimes the details printed by that option are very subtle yet telling.
You can also try setting the log4j level to DEBUG or TRACE for org.apache.hadoop.hbase.ipc to see if there is more context there.
Hi @Josh Elser - I have attached 2 logs (after kerberos debug flag) for the HMaster and HRegion server. Overall logs are quite huge so I have tried to remove some portions. Like I removed the hex dump of tickets. However, do let me know if I need to attach full logs.
Some comments in general: The error "GSS initiate failed" shows up even when master and region server come up on same host.
After enabling debug logs for HBase, HMaster shows: 2017-01-20 18:17:11,699 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /srvuser/hbase/rs/a1.example.com,52412,1484889430172
2017-01-20 18:17:11,823 DEBUG [RpcServer.listener,port=42263] ipc.RpcServer: RpcServer.listener,port=42263: connection from 10.64.130.53:46270; # active connections: 1
2017-01-20 18:17:11,856 DEBUG [RpcServer.reader=2,bindAddress=a1.example.com,port=42263] ipc.RpcServer: Kerberos principal name is srvuser/a1.example.com@ADC.EXAMPLE.COM
2017-01-20 18:17:11,857 DEBUG [RpcServer.reader=2,bindAddress=a1.example.com,port=42263] ipc.RpcServer: Created SASL server with mechanism = GSSAPI
2017-01-20 18:17:11,857 DEBUG [RpcServer.reader=2,bindAddress=a1.example.com,port=42263] ipc.RpcServer: Have read input token of size 1824 for processing by saslServer.evaluateResponse()
2017-01-20 18:17:11,857 DEBUG [RpcServer.reader=2,bindAddress=a1.example.com,port=42263] ipc.RpcServer: RpcServer.listener,port=42263: Caught exception while reading:GSS initiate failed
2017-01-20 18:17:11,857 DEBUG [RpcServer.reader=2,bindAddress=a1.example.com,port=42263] ipc.RpcServer: RpcServer.listener,port=42263: DISCONNECTING client 10.64.130.53:46270 because read count=-1. Number of active connections: 1hbase-regionsvr-kerberos-output.txthmaster-kerberos-flag-output.txt
Hi @Josh Elser - I hope the logs I attached show some issue. I am not able to find anything amiss. Plus, with a simple socket server test program I notice that we are able to successfully get service ticket and also send data back and forth. So, I'd assume that from kerberos side the issue is not there. Then, something from hbase side is messing things up? Or maybe some permissions of user / user groups in HDFS, LDAP?
"GSS initiate failed" without any detailed error message is like a catch-all in the class com.sun.security.sasl.gsskerb.GssKrb5Server in the method evaluateResponse(byte responseData). There are many lines in that big try-catch that can throw this exception.
So, after I debugged the running hbase master process through a debugger I am able to clearly see that the stack trace leading to this GSSException is actually coming from method acceptSecContext(InputStream is, int mechTokenSize) in the class Krb5Context. The actual exception is KrbException being thrown as GSSException. This points to issues in keytab in master node vs details in service ticket that region server is presenting, assuming JCE and encryption types on both nodes are appropriate.
Hi @Josh Elser - I hope you are able to spot a problem in data that I uploaded. Is there any hint in that which corroborates with my assessment above or points to a different problem altogether? Any input is appreciated.
Nothing jumped out at me from your logs. Sorry I can't be of help with the information you provided. You should consider exercising your HDP support contract if you need more help.