Support Questions
Find answers, ask questions, and share your expertise

Falcon fails after a few hours

Falcon fails after a few hours

Contributor

The falcon service fails after a few hours with a GSSException error saying it can't get a kerberos key. We have to restart the service after every few hours. After restarting, things are OK and we are able to run the jobs, but then after 3-4 hours, everything start going to an "UNKNOWN" state.

There's currently a JIRA https://issues.apache.org/jira/browse/FALCON-1595 that is related to what we have but my understanding is that the affected version is 0.8 and we're using 0.6 (HDP 2.4.2).

Based on the logs there's background timer that runs to check for the keys. I've lowered the ticket validity setting from 10 hours to 5 hours which makes me believe this will force falcon check the tickets. Our ticket expires after 24 hours so not sure why this will help, but I gave it a shot.

Here's the error log:

2016-10-21 14:56:42,992 INFO  - [Timer-2:] ~ Logging in user1 (CurrentUser:65)
2016-10-21 14:56:43,019 INFO  - [Timer-2:] ~ Creating FS impersonating user user1 (HadoopClientFactory:196)
2016-10-21 14:56:43,023 WARN  - [Timer-2:] ~ Exception encountered while connecting to the server :  (Client:685)
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
	at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:563)
	at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:378)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:732)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:728)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:727)
	at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:378)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1492)
	at org.apache.hadoop.ipc.Client.call(Client.java:1402)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy23.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:575)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
	at com.sun.proxy.$Proxy24.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2140)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2123)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:849)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:107)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:911)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:907)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:907)
	at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
	at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1655)
	at org.apache.falcon.cleanup.AbstractCleanupHandler.getAllLogs(AbstractCleanupHandler.java:102)
	at org.apache.falcon.cleanup.AbstractCleanupHandler.delete(AbstractCleanupHandler.java:143)
	at org.apache.falcon.cleanup.ProcessCleanupHandler.cleanup(ProcessCleanupHandler.java:39)
	at org.apache.falcon.service.LogCleanupService$CleanupThread.run(LogCleanupService.java:68)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
	at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
	at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
	at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
	at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
	at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
	at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
	at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
	... 40 more
2016-10-21 14:56:43,025 INFO  - [Timer-2:] ~ Exception while invoking getListing of class ClientNamenodeProtocolTranslatorPB over <falcon_host>/xxx.xxx.56.168:8020. Trying to fail over immediately. (RetryInvocationHandler:148)
java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "<falcon_host>/xxx.xxx.56.168"; destination host is: "<falcon_host>":8020; 
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:782)
	at org.apache.hadoop.ipc.Client.call(Client.java:1430)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy23.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:575)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
	at com.sun.proxy.$Proxy24.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2140)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2123)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:849)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:107)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:911)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:907)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:907)
	at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
	at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1655)
	at org.apache.falcon.cleanup.AbstractCleanupHandler.getAllLogs(AbstractCleanupHandler.java:102)
	at org.apache.falcon.cleanup.AbstractCleanupHandler.delete(AbstractCleanupHandler.java:143)
	at org.apache.falcon.cleanup.ProcessCleanupHandler.cleanup(ProcessCleanupHandler.java:39)
	at org.apache.falcon.service.LogCleanupService$CleanupThread.run(LogCleanupService.java:68)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
	at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:690)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
	at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:653)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:740)
	at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:378)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1492)
	at org.apache.hadoop.ipc.Client.call(Client.java:1402)
	... 28 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
	at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:563)
	at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:378)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:732)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:728)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:727)
	... 31 more
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
	at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
	at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
	at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
	at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
	at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
	at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
	at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
	... 40 more
2016-10-21 14:56:43,028 WARN  - [Timer-2:] ~ Exception encountered while connecting to the server :  (Client:685)
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
	at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
	at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
	at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:563)
	at org.apache.hadoop.ipc.Client$Connection.access$1900(Client.java:378)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:732)
	at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:728)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:727)
	at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:378)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1492)
	at org.apache.hadoop.ipc.Client.call(Client.java:1402)
	at org.apache.hadoop.ipc.Client.call(Client.java:1363)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy23.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:575)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
	at com.sun.proxy.$Proxy24.getListing(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2140)
	at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2123)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:849)
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:107)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:911)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:907)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:907)
	at org.apache.hadoop.fs.Globber.listStatus(Globber.java:69)
	at org.apache.hadoop.fs.Globber.glob(Globber.java:217)
	at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1655)
	at org.apache.falcon.cleanup.AbstractCleanupHandler.getAllLogs(AbstractCleanupHandler.java:102)
	at org.apache.falcon.cleanup.AbstractCleanupHandler.delete(AbstractCleanupHandler.java:143)
	at org.apache.falcon.cleanup.ProcessCleanupHandler.cleanup(ProcessCleanupHandler.java:39)
	at org.apache.falcon.service.LogCleanupService$CleanupThread.run(LogCleanupService.java:68)
	at java.util.TimerThread.mainLoop(Timer.java:555)
	at java.util.TimerThread.run(Timer.java:505)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
	at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
	at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
	at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
	at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
	at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
	at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
	at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
	... 40 more