Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Yarn Delegation Token

avatar
Contributor

Hi Community,

I am facing a strange issue with the secure hadoop setup. The challenge is the kerberos ticket_lifetime which is 10 hours. This is different from the default behavior of 24h in general. Please note that the ticket_lifetime limit right now cannot be changed.

The job runs on YARN and it is seen that every 10 hours YARN remove the application. I believe this is related to the Delegation Token not getting renewed in timely manner. I don't see any reliable logs to make this determination.

My question is:

Does this look to be related to the kerberos ticket lifetime?

Is there a delegation token renewal for yarn? Or is it just the hdfs?

Should I try to modify the value for dfs.namenode.delegation.token.renew-interval? Right now this value is set to 24h (86400000).

1 ACCEPTED SOLUTION

avatar
Contributor

This turns out to be an issue with something else in HAWQ 2.0.0. Since libhdfs3 and libyarn uses the same kerberos keyfile. The application was failing in the event there wasn't any activity through libyarn for long time which in turn meant that login() function wasn't called.

The problem is documented in https://issues.apache.org/jira/browse/HAWQ-940

The issue is identified and addressed.

Moreover, there wasn't anything in resource manager logs except for the container release logs.

View solution in original post

2 REPLIES 2

avatar
Guru

Hello @Gagan Brahmi ,

> Does this look to be related to the Kerberos ticket lifetime? If this is happening every time, exactly after 10 hours, then it does look related. But I'd rather confirm before concluding.

> Is there a delegation token renewal for yarn? Or is it just the hdfs? Yes, there is a delegation token renewal mechanism in YARN.

As for the clues on why YARN is removing the application every 10 hours, I'd look into Resource Manager log for any warning & error around the application ID (specially around 10 hour mark from job submission). Also what does application log say about this? Any error / warning?

avatar
Contributor

This turns out to be an issue with something else in HAWQ 2.0.0. Since libhdfs3 and libyarn uses the same kerberos keyfile. The application was failing in the event there wasn't any activity through libyarn for long time which in turn meant that login() function wasn't called.

The problem is documented in https://issues.apache.org/jira/browse/HAWQ-940

The issue is identified and addressed.

Moreover, there wasn't anything in resource manager logs except for the container release logs.