Hi everyone,
 
I’m encountering a critical issue with HBase 2.x: the RegionServer fails to connect to the Master, throwing a “GSS initiate failed” error.
 
Environment:
- Master node: host117
- RegionServer node: host121
- Kerberos security is enabled
 
To troubleshoot this, I’ve performed the following checks and fixes—all verified as successful (:white_heavy_check_mark:):
 
- Time Synchronization
 Clock skew across cluster nodes is only 8 seconds, well within Kerberos tolerance (typically ≤ 5 minutes).
 
- Hostname Resolution
 Added explicit entries in /etc/hosts for both host117 and host121 to ensure bidirectional hostname resolution, eliminating potential Kerberos failures due to DNS issues.
 
- Network Connectivity
 Confirmed TCP connectivity to the Master’s RPC port using telnet host117 16000.
 
- Kerberos Client Configuration (/etc/krb5.conf) - Verified KDC is reachable and TGS requests succeed.
- Confirmed support for AES256 and AES128 encryption types, matching HBase requirements.
 
- JAAS Configuration Fix - Added the Server login module in the JAAS config file.
- Ensured critical parameters are correctly set: useKeyTab=true, valid keyTab path, and accurate principal.
- Explicitly set useTicketCache=false to prevent ticket cache interference with keytab-based authentication.
 
- HBase Security Settings (hbase-site.xml) - Confirmed hbase.security.authentication=kerberos.
- Validated correct configuration of hbase.master.kerberos.principal and hbase.regionserver.kerberos.principal.
 
- Kerberos Ticket Acquisition & Validation - Successfully obtained tickets using kinit -kt <keytab> <principal>.
- Verified ticket validity, service principal, and encryption type via klist.
 
- Ticket Cache Cleanup - Ran kdestroy to clear any stale tickets that might cause conflicts.
 
 
Despite all the above checks passing, the issue persists.
Has anyone else encountered a similar “GSS initiate failed” error?
Any suggestions on what I might have missed or additional debugging steps would be greatly appreciated!