Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

YARN TimelineServer Error after Cluster Kerberize GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)

avatar
New Contributor

Hello,

 

I have recently Kerberized our Hadoop cluster using Apache Ambari and HDFS seems to be working fine. However, YARN ResourceManager and TimelinseServer seem to not be able to communicate:

 

yarn-yarn-resourcemanager-XXXXXX.log:

 

 

2020-03-16 22:58:27,382 ERROR metrics.SystemMetricsPublisher (SystemMetricsPublisher.java:putEntity(549)) - Error when publishing entity [YARN_APPLICATION,application_1581534326709_0003]
com.sun.jersey.api.client.ClientHandlerException: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: <a href="<a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a>" target="_blank"><a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a</a>>, status: 403, message: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
        at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:237)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:186)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:250)
        at com.sun.jersey.api.client.Client.handle(Client.java:648)
        at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
        at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
        at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
        at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPostingObject(TimelineWriter.java:156)
        at org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:115)
        at org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:112)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
        at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPosting(TimelineWriter.java:112)
        at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.putEntities(TimelineWriter.java:92)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:348)
        at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:536)
        at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationFinishedEvent(SystemMetricsPublisher.java:349)
        at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:254)
        at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:564)
        at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:559)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: <a href="<a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a>" target="_blank"><a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a</a>>, status: 403, message: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:481)
        at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:159)
        at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
        ... 24 more
Caused by: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: <a href="<a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a>" target="_blank"><a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a</a>>, status: 403, message: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
        at org.apache.hadoop.security.authentication.client.AuthenticatedURL.extractToken(AuthenticatedURL.java:281)
        at org.apache.hadoop.security.authentication.client.PseudoAuthenticator.authenticate(PseudoAuthenticator.java:77)
        at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:133)
        at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:212)
        at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:133)
        at org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:216)
        at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.openConnection(DelegationTokenAuthenticatedURL.java:322)
        at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:476)
        ... 26 more

 

 

Similarly, in yarn-yarn-timelineserver-XXXXXX.log

 

 

 

2020-03-16 23:12:11,902 INFO  timeline.RollingLevelDBTimelineStore (RollingLevelDBTimelineStore.java:evictOldStartTimes(1440)) - Searching for start times to evict earlier than 1581736331902
2020-03-16 23:12:11,906 INFO  timeline.RollingLevelDBTimelineStore (RollingLevelDBTimelineStore.java:evictOldStartTimes(1496)) - Deleted 0/43 start time entities earlier than 1581736331902
2020-03-16 23:12:11,906 INFO  timeline.RollingLevelDB (RollingLevelDB.java:evictOldDBs(344)) - Evicting indexes-ldb DBs scheduled for eviction
2020-03-16 23:12:11,906 INFO  timeline.RollingLevelDB (RollingLevelDB.java:evictOldDBs(344)) - Evicting entity-ldb DBs scheduled for eviction
2020-03-16 23:12:11,907 INFO  timeline.RollingLevelDBTimelineStore (RollingLevelDBTimelineStore.java:discardOldEntities(1519)) - Discarded 0 entities for timestamp 1581736331902 and earlier in 0.005 seconds
2020-03-16 23:12:35,561 WARN  server.AuthenticationFilter (AuthenticationFilter.java:doFilter(588)) - Authentication exception: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
2020-03-16 23:13:36,501 WARN  server.AuthenticationFilter (AuthenticationFilter.java:doFilter(588)) - Authentication exception: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
2020-03-16 23:14:36,256 WARN  server.AuthenticationFilter (AuthenticationFilter.java:doFilter(588)) - Authentication exception: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
2020-03-16 23:15:37,280 WARN  server.AuthenticationFilter (AuthenticationFilter.java:doFilter(588)) - Authentication exception: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)

 

 

 

/etc/krb5.conf

 

 

[libdefaults]
  rdns = false
  ignore_acceptor_hostname = true
  renew_lifetime = 7d
  forwardable = true
  default_realm = atlas.local
  ticket_lifetime = 24h
  dns_lookup_realm = false
  dns_lookup_kdc = false
  default_ccache_name = /tmp/krb5cc_%{uid}
  #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
  #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5

[logging]
  default = FILE:/var/log/krb5kdc.log
  admin_server = FILE:/var/log/kadmind.log
  kdc = FILE:/var/log/krb5kdc.log

[realms]
  YYYYYY.local = {
    admin_server = XXXXXX
    kdc = XXXXXX
  }

 

 

I have been struggling with this for a couple of days. Any help would be appreciated.

 

Thank you

 

 

1 ACCEPTED SOLUTION

avatar
New Contributor

The problem was regrading multi-homed host configurations. In our cluster, the hostname and host FQDNs were different. In such environments, it is important to make sure _HOST in Hadoop configurations translates to the correct name. 

 

This page has this issue covered in more details, but shortly put, _HOST is by default substituted to InetAddress.getLocalHost().getCanonicalHostName().toLowerCase()
unless hadoop.security.dns.interface is set:

 

 

import java.net.InetAddress;
public class CheckHostResolution {
  public static void main(String[] args) {
    try {
        String s = InetAddress.getLocalHost().getCanonicalHostName();
        System.out.println(s);
    } catch (Exception ex) {
      System.err.println(ex);
    }
  }

 

 

Using this snippet, you can double-check what _HOST resolves to on a machine. This should match the principal names in the keytabs. In our case, _HOST resolved to the value of /etc/hostname since a DNS was not mentioned in configurations, which was the short version (say: plaza, instead of plaza.localdomain.com). However, in the keytabs generated by Ambari, the principals were the FQDN form  plaza.localdomain.com.

 

Hence, what simply solved the problem was updating the order of those names in the /etc/hosts file which is used for resolution. i.e. it used to be:

 

192.168.100.101          plaza plaza.localdomain.com

 

And the problem was solved by changing it to:

 

192.168.100.101          plaza.localdomain.com plaza

 

Cheers.

 

View solution in original post

1 REPLY 1

avatar
New Contributor

The problem was regrading multi-homed host configurations. In our cluster, the hostname and host FQDNs were different. In such environments, it is important to make sure _HOST in Hadoop configurations translates to the correct name. 

 

This page has this issue covered in more details, but shortly put, _HOST is by default substituted to InetAddress.getLocalHost().getCanonicalHostName().toLowerCase()
unless hadoop.security.dns.interface is set:

 

 

import java.net.InetAddress;
public class CheckHostResolution {
  public static void main(String[] args) {
    try {
        String s = InetAddress.getLocalHost().getCanonicalHostName();
        System.out.println(s);
    } catch (Exception ex) {
      System.err.println(ex);
    }
  }

 

 

Using this snippet, you can double-check what _HOST resolves to on a machine. This should match the principal names in the keytabs. In our case, _HOST resolved to the value of /etc/hostname since a DNS was not mentioned in configurations, which was the short version (say: plaza, instead of plaza.localdomain.com). However, in the keytabs generated by Ambari, the principals were the FQDN form  plaza.localdomain.com.

 

Hence, what simply solved the problem was updating the order of those names in the /etc/hosts file which is used for resolution. i.e. it used to be:

 

192.168.100.101          plaza plaza.localdomain.com

 

And the problem was solved by changing it to:

 

192.168.100.101          plaza.localdomain.com plaza

 

Cheers.