Support Questions

Find answers, ask questions, and share your expertise

GSS exception error seen in a scenario even after having valid kerberos ticket

avatar
Explorer

We have a application which ingest files from LOCAL file system to HDFS in AD kerberos enabled environment . This basically moves files from Local directory to HDFS path. Once the ingestion process start , after 20 hours, we see the below error given randomly and after sometime, we see the error continuously . And finally, no files are moved.

Error :

java.io.IOException: java.io.IOException: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "example1.com/xxxxx"; destination host is: "example2.com":8020;

We have the application running in two environment i.e, Env-1 and Env-2 .

The same ingestion process is working fine without any error in Env-1 , and in Env-2 we see the GSS exception error .

There is difference of load and incoming files in Env-1 and Env-2 .

Env-1 - Per day , 5 files are moved to HDFS and without any error .And the same process goes on everyday.

Env-2 - Per 5 minute , 6000 files are moved to HDFS and the GSS exception error is seen after 20 hours. 6000 files are moved into HDFS from 42 different directories simultaneously and total number of threads used are 150 . At a time, 150 files can be moved simultaneously , once the threads are released, it will pick the next files. Hence, the process goes on.

Can anyone comment on the below concern and issue seen :

1. Is there something to do with load in KDC server.

2. Are there any parameters in AD server which restrict the number of count of TGT to be generated from KDC at a time.

3. Is there something to do with Kerberos tolerance time. In AD server, Maximum tolerance time is set to 5 min.

4. Please suggest If any parameters need to be added in krb5.conf to handle load and handle huge number of requests incoming to AD at a time.

We had checked the below in AD server and Env-2 :

1. AD server and Env-2 time are in sync.

2. Kerberos ticket is not expired. We have set a cron job to renew kerberos ticket every 4 hours.

3. Lifetime of ticket is set in krb5.conf accordingly :

renew_lifetime = 7d

ticket_lifetime = 24h

Can anyone suggest what might be the issue.

Thank you.

5 REPLIES 5

avatar
Super Collaborator

the machine names from the error log are the expected ones? So this basically means: env-2 is example1.com and example2.com is your hdfs master node (port 8020 should be hdfs file service from the name node)?

  • Are all the issues related to the communication between env-2 and your name node, or do you have other hosts involved as well?
  • does the process on env-1 start 5 times a day, or is it started once and continues to run (sleeping instead of terminating)?
  • the ticket renewal on env-1 is identical to the ticket renewal on env-2?

I am just wondering if it is possible, that your process on env-2 only takes the ticket at start-up, and when the ticket expires, it just doesn't pick the renewed ticket? If after a restart of your processes on env-2 all authentication issues are gone for the next around 20h, this might be the case. And if on env-1 the process is starting 5 times a day instead of continuously running it might be the reason that the issue is not occure on env-1.

avatar
Explorer

@Harald Berghoff

Thank you for responding . Please find my responses below in bold -

the machine names from the error log are the expected ones? So this basically means: env-2 is example1.com and example2.com is your hdfs master node (port 8020 should be hdfs file service from the name node)?

------ Yes, env-2 is example1.com and example2.com is hdfs master node

  • Are all the issues related to the communication between env-2 and your name node, or do you have other hosts involved as well? -------- Yes, Other hosts are involved as well . Issue is seen other nodes as well.
  • does the process on env-1 start 5 times a day, or is it started once and continues to run (sleeping instead of terminating)?
    --------- In env -1 , process runs once and continues to run . Only 5 files are moved to HDFS in a single day . We don't restart any process.
  • the ticket renewal on env-1 is identical to the ticket renewal on env-2? ------ ticket renewal is identical in both the env.

I am just wondering if it is possible, that your process on env-2 only takes the ticket at start-up, and when the ticket expires, it just doesn't pick the renewed ticket? If after a restart of your processes on env-2 all authentication issues are gone for the next around 20h, this might be the case. And if on env-1 the process is starting 5 times a day instead of continuously running it might be the reason that the issue is not occure on env-1.

------ In env-1 , the process is continuously running and we don't restart the process.

------- In env-2 , when the GSS issue pops up, we restart our process and all authentication issues are gone for the next around 20 h.

My issue here is, why restart is needed for env-2 . As, in env-1 , all works good without any restart .

As I have mentioned in my previous comment, the only difference between env-1 and env-2 is the load and amount of files are moved to HDFS simultaneously is huge in env-2.


Please comment, if you need any more information for analysis.

Thank you


avatar
Super Collaborator

my guess would be that this is a race condition on env-2, leading to the situation that your process doesn't really see the renewed ticket. Can you change the logging, so that the threads are logging the ticket dates when failing?

If it is a load issue with the KDC, you would see the error messages on other clients as well, and it should also by chance go away again. My assumption is that you have thread by thread getting the authentication error, and when all your threads are 'down' you see that no file is moved anymore.

just to be sure (as mentioned I don't think the root cause is to be found here): check that the hfds name node and your AD are in time sync also.

avatar
New Contributor

@vishakhaa9  @arald  Did you find any solution? We are also facing the same issue.

avatar
Community Manager

@sourabhhh Welcome to our community! As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.

 



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: