Community Articles
Find and share helpful community-sourced technical articles
Labels (1)

The HDFS Balancer program can be invoked to rebalance HDFS blocks when data nodes are added to or removed from the cluster. For more information about the HDFS balancer, see this HCC article.

Since Kerberos tickets are designed to expire, a common question that arises in secure clusters is whether one needs to account for ticket expiration (namely, TGT) when invoking long-running Balancer jobs. To cut to the chase: the answer depends on just how long the job takes to run. Let's discuss some background context (I am referencing Chris Nauroth's excellent answer on Stack Overflow as well as HDFS-9698 below).

The primary use case for Kerberos authentication in the Hadoop ecosystem is Hadoop's RPC framework, which uses SASL for authentication. Many daemon processes, i.e., non-interactive processes, call UserGroupInformation#loginUserFromKeytab at process startup, using a keytab to authenticate to the KDC.

Moreover, Hadoop implements an automatic re-login mechanism directly inside the RPC client layer. The code for this is visible in the RPC Client#handleSaslConnectionFailure method:

// try re-login
	if (UserGroupInformation.isLoginKeytabBased()) {
	} else if (UserGroupInformation.isLoginTicketBased()) {

However, the Balancer is not designed to be run as a daemon (in Hadoop 2.7.3, i.e., HDP 2.6 and earlier)! Please see HDFS-9804, which introduces this capability. With this in place, the Balancer would log in with a keytab and the above re-login mechanism would take care of everything.

Since the Balancer is designed to be run interactively, the assumption is that kinit has already run, and there is a TGT sitting in the ticket cache. Now we need to understand some Kerberos configuration settings, in particular the distinction between ticket_lifetime and renew_lifetime.

Every ticket, including TGTs, have a ticket_lifetime (usually around 18 hours), this strikes the balance between annoying the user by requiring they log in multiple times during their workday and mitigating the risk of TGTs being stolen (note there is separate support for preventing replay of authenticators).

A ticket can be renewed, but only up to its renew_lifetime (usually around 7 days). which it can be renewed to extend to a maximum value of the later. Since a TGT is generated by the user and provided to the balancer (which means in the balancer context, UserGroupInformation.isLoginTicketBased() == true), client#handleSaslConnectionFailure is behaving correctly on extending the ticket_lifetime.

But there's no way to extend beyond the renew_lifetime!

To summarize, if your balancer job is going to run longer than the configured renew_lifetime in your environment (a week by default), then you need to worry about ticket renewal (or you need HDFS-9804). Otherwise, you will be fine relying on the RPC framework to renew the TGT's ticket_lifetime (as long it doesn't eclipse the renew_lifetime).