Created on 12-26-2017 09:14 AM - edited 09-16-2022 05:40 AM
We have a application which ingest files from LOCAL file system to HDFS in AD kerberos enabled environment . This basically moves files from Local directory to HDFS path. Once the ingestion process start , after 20 hours, we see the below error given randomly and after sometime, we see the error continuously . And finally, no files are moved.
Error :
java.io.IOException: java.io.IOException: java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "example1.com/xxxxx"; destination host is: "example2.com":8020;
We have the application running in two environment i.e, Env-1 and Env-2 .
The same ingestion process is working fine without any error in Env-1 , and in Env-2 we see the GSS exception error .
There is difference of load and incoming files in Env-1 and Env-2 .
Env-1 - Per day , 5 files are moved to HDFS and without any error .And the same process goes on everyday.
Env-2 - Per 5 minute , 6000 files are moved to HDFS and the GSS exception error is seen after 20 hours. 6000 files are moved into HDFS from 42 different directories simultaneously and total number of threads used are 150 . At a time, 150 files can be moved simultaneously , once the threads are released, it will pick the next files. Hence, the process goes on.
Can anyone comment on the below concern and issue seen :
1. Is there something to do with load in KDC server.
2. Are there any parameters in AD server which restrict the number of count of TGT to be generated from KDC at a time.
3. Is there something to do with Kerberos tolerance time. In AD server, Maximum tolerance time is set to 5 min.
4. Please suggest If any parameters need to be added in krb5.conf to handle load and handle huge number of requests incoming to AD at a time.
We had checked the below in AD server and Env-2 :
1. AD server and Env-2 time are in sync.
2. Kerberos ticket is not expired. We have set a cron job to renew kerberos ticket every 4 hours.
3. Lifetime of ticket is set in krb5.conf accordingly :
renew_lifetime = 7d
ticket_lifetime = 24h
Can anyone suggest what might be the issue.
Thank you.
Created 12-26-2017 11:29 AM
the machine names from the error log are the expected ones? So this basically means: env-2 is example1.com and example2.com is your hdfs master node (port 8020 should be hdfs file service from the name node)?
I am just wondering if it is possible, that your process on env-2 only takes the ticket at start-up, and when the ticket expires, it just doesn't pick the renewed ticket? If after a restart of your processes on env-2 all authentication issues are gone for the next around 20h, this might be the case. And if on env-1 the process is starting 5 times a day instead of continuously running it might be the reason that the issue is not occure on env-1.
Created 12-26-2017 12:01 PM
Thank you for responding . Please find my responses below in bold -
the machine names from the error log are the expected ones? So this basically means: env-2 is example1.com and example2.com is your hdfs master node (port 8020 should be hdfs file service from the name node)?
------ Yes, env-2 is example1.com and example2.com is hdfs master node
I am just wondering if it is possible, that your process on env-2 only takes the ticket at start-up, and when the ticket expires, it just doesn't pick the renewed ticket? If after a restart of your processes on env-2 all authentication issues are gone for the next around 20h, this might be the case. And if on env-1 the process is starting 5 times a day instead of continuously running it might be the reason that the issue is not occure on env-1.
------ In env-1 , the process is continuously running and we don't restart the process.
------- In env-2 , when the GSS issue pops up, we restart our process and all authentication issues are gone for the next around 20 h.
My issue here is, why restart is needed for env-2 . As, in env-1 , all works good without any restart .
As I have mentioned in my previous comment, the only difference between env-1 and env-2 is the load and amount of files are moved to HDFS simultaneously is huge in env-2.
Please comment, if you need any more information for analysis.
Thank you
Created 12-27-2017 07:43 AM
my guess would be that this is a race condition on env-2, leading to the situation that your process doesn't really see the renewed ticket. Can you change the logging, so that the threads are logging the ticket dates when failing?
If it is a load issue with the KDC, you would see the error messages on other clients as well, and it should also by chance go away again. My assumption is that you have thread by thread getting the authentication error, and when all your threads are 'down' you see that no file is moved anymore.
just to be sure (as mentioned I don't think the root cause is to be found here): check that the hfds name node and your AD are in time sync also.
Created 06-27-2023 10:08 PM
@vishakhaa9 @arald Did you find any solution? We are also facing the same issue.
Created 06-29-2023 09:53 PM
@sourabhhh Welcome to our community! As this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.
Regards,
Vidya Sargur,