Created on 03-16-2020 08:18 PM - last edited on 03-16-2020 10:28 PM by VidyaSargur
Hello,
I have recently Kerberized our Hadoop cluster using Apache Ambari and HDFS seems to be working fine. However, YARN ResourceManager and TimelinseServer seem to not be able to communicate:
yarn-yarn-resourcemanager-XXXXXX.log:
2020-03-16 22:58:27,382 ERROR metrics.SystemMetricsPublisher (SystemMetricsPublisher.java:putEntity(549)) - Error when publishing entity [YARN_APPLICATION,application_1581534326709_0003]
com.sun.jersey.api.client.ClientHandlerException: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: <a href="<a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a>" target="_blank"><a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a</a>>, status: 403, message: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:237)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:186)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:250)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPostingObject(TimelineWriter.java:156)
at org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:115)
at org.apache.hadoop.yarn.client.api.impl.TimelineWriter$1.run(TimelineWriter.java:112)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.doPosting(TimelineWriter.java:112)
at org.apache.hadoop.yarn.client.api.impl.TimelineWriter.putEntities(TimelineWriter.java:92)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:348)
at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.putEntity(SystemMetricsPublisher.java:536)
at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.publishApplicationFinishedEvent(SystemMetricsPublisher.java:349)
at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher.handleSystemMetricsEvent(SystemMetricsPublisher.java:254)
at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:564)
at org.apache.hadoop.yarn.server.resourcemanager.metrics.SystemMetricsPublisher$ForwardingEventHandler.handle(SystemMetricsPublisher.java:559)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: <a href="<a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a>" target="_blank"><a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a</a>>, status: 403, message: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:481)
at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:159)
at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
... 24 more
Caused by: org.apache.hadoop.security.authentication.client.AuthenticationException: Authentication failed, URL: <a href="<a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a>" target="_blank"><a href="http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a" target="_blank">http://w.x.y.z:8188/ws/v1/timeline/?user.name=yarn</a</a>>, status: 403, message: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
at org.apache.hadoop.security.authentication.client.AuthenticatedURL.extractToken(AuthenticatedURL.java:281)
at org.apache.hadoop.security.authentication.client.PseudoAuthenticator.authenticate(PseudoAuthenticator.java:77)
at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:133)
at org.apache.hadoop.security.authentication.client.KerberosAuthenticator.authenticate(KerberosAuthenticator.java:212)
at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:133)
at org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:216)
at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.openConnection(DelegationTokenAuthenticatedURL.java:322)
at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:476)
... 26 more
Similarly, in yarn-yarn-timelineserver-XXXXXX.log
2020-03-16 23:12:11,902 INFO timeline.RollingLevelDBTimelineStore (RollingLevelDBTimelineStore.java:evictOldStartTimes(1440)) - Searching for start times to evict earlier than 1581736331902
2020-03-16 23:12:11,906 INFO timeline.RollingLevelDBTimelineStore (RollingLevelDBTimelineStore.java:evictOldStartTimes(1496)) - Deleted 0/43 start time entities earlier than 1581736331902
2020-03-16 23:12:11,906 INFO timeline.RollingLevelDB (RollingLevelDB.java:evictOldDBs(344)) - Evicting indexes-ldb DBs scheduled for eviction
2020-03-16 23:12:11,906 INFO timeline.RollingLevelDB (RollingLevelDB.java:evictOldDBs(344)) - Evicting entity-ldb DBs scheduled for eviction
2020-03-16 23:12:11,907 INFO timeline.RollingLevelDBTimelineStore (RollingLevelDBTimelineStore.java:discardOldEntities(1519)) - Discarded 0 entities for timestamp 1581736331902 and earlier in 0.005 seconds
2020-03-16 23:12:35,561 WARN server.AuthenticationFilter (AuthenticationFilter.java:doFilter(588)) - Authentication exception: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
2020-03-16 23:13:36,501 WARN server.AuthenticationFilter (AuthenticationFilter.java:doFilter(588)) - Authentication exception: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
2020-03-16 23:14:36,256 WARN server.AuthenticationFilter (AuthenticationFilter.java:doFilter(588)) - Authentication exception: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
2020-03-16 23:15:37,280 WARN server.AuthenticationFilter (AuthenticationFilter.java:doFilter(588)) - Authentication exception: org.apache.hadoop.security.authentication.client.AuthenticationException: GSSException: Failure unspecified at GSS-API level (Mechanism level: Checksum failed)
/etc/krb5.conf
[libdefaults]
rdns = false
ignore_acceptor_hostname = true
renew_lifetime = 7d
forwardable = true
default_realm = atlas.local
ticket_lifetime = 24h
dns_lookup_realm = false
dns_lookup_kdc = false
default_ccache_name = /tmp/krb5cc_%{uid}
#default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
#default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5
[logging]
default = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
kdc = FILE:/var/log/krb5kdc.log
[realms]
YYYYYY.local = {
admin_server = XXXXXX
kdc = XXXXXX
}
I have been struggling with this for a couple of days. Any help would be appreciated.
Thank you
Created 04-02-2020 05:54 PM
The problem was regrading multi-homed host configurations. In our cluster, the hostname and host FQDNs were different. In such environments, it is important to make sure _HOST in Hadoop configurations translates to the correct name.
This page has this issue covered in more details, but shortly put, _HOST is by default substituted to InetAddress.getLocalHost().getCanonicalHostName().toLowerCase()
unless hadoop.security.dns.interface is set:
import java.net.InetAddress;
public class CheckHostResolution {
public static void main(String[] args) {
try {
String s = InetAddress.getLocalHost().getCanonicalHostName();
System.out.println(s);
} catch (Exception ex) {
System.err.println(ex);
}
}
Using this snippet, you can double-check what _HOST resolves to on a machine. This should match the principal names in the keytabs. In our case, _HOST resolved to the value of /etc/hostname since a DNS was not mentioned in configurations, which was the short version (say: plaza, instead of plaza.localdomain.com). However, in the keytabs generated by Ambari, the principals were the FQDN form plaza.localdomain.com.
Hence, what simply solved the problem was updating the order of those names in the /etc/hosts file which is used for resolution. i.e. it used to be:
192.168.100.101 plaza plaza.localdomain.com
And the problem was solved by changing it to:
192.168.100.101 plaza.localdomain.com plaza
Cheers.
Created 04-02-2020 05:54 PM
The problem was regrading multi-homed host configurations. In our cluster, the hostname and host FQDNs were different. In such environments, it is important to make sure _HOST in Hadoop configurations translates to the correct name.
This page has this issue covered in more details, but shortly put, _HOST is by default substituted to InetAddress.getLocalHost().getCanonicalHostName().toLowerCase()
unless hadoop.security.dns.interface is set:
import java.net.InetAddress;
public class CheckHostResolution {
public static void main(String[] args) {
try {
String s = InetAddress.getLocalHost().getCanonicalHostName();
System.out.println(s);
} catch (Exception ex) {
System.err.println(ex);
}
}
Using this snippet, you can double-check what _HOST resolves to on a machine. This should match the principal names in the keytabs. In our case, _HOST resolved to the value of /etc/hostname since a DNS was not mentioned in configurations, which was the short version (say: plaza, instead of plaza.localdomain.com). However, in the keytabs generated by Ambari, the principals were the FQDN form plaza.localdomain.com.
Hence, what simply solved the problem was updating the order of those names in the /etc/hosts file which is used for resolution. i.e. it used to be:
192.168.100.101 plaza plaza.localdomain.com
And the problem was solved by changing it to:
192.168.100.101 plaza.localdomain.com plaza
Cheers.