Hard to say, but the timeout indicates that the client could not reach the KDC via UDP from that host. Could be firewall, DNS, etc.
UDP has packet size restrictions that often don't permit Active Directory tickets to be issued. Generally, the KDC will tell the client and the client will try TCP, but it seems on your one host that a connection to the KDC cannot even be made. Firewall rules are certainly suspect but a number of things could cause this.
Using TCP always is fine.
Thank you @bgooley. Will be able to clarify the below query?
When comparing to other host (like below). Does that mean it try to connect with UDP first and then switched to TCP?
>>> KrbKdcReq send: kdc=****** UDP:88, timeout=3000, number of retries =3, #bytes=2247
>>> KDCCommunication: kdc=****** UDP:88, timeout=3000,Attempt =1, #bytes=2247
>>> KrbKdcReq send: #bytes read=104
>>> KrbKdcReq send: kdc=****** TCP:88, timeout=3000, number of retries =3, #bytes=2247
>>> KDCCommunication: kdc=****** TCP:88, timeout=3000,Attempt =1, #bytes=2247
>>>DEBUG: TCPClient reading 2722 bytes
>>> KrbKdcReq send: #bytes read=2722
If so, what could be the reason the affected host is not switching in similar manner. If that is with firewall, then how is it working when we add paramater udp_preference_limit to connect with TCP?
First, firewalls can easily block UDP and allow TCP. I mentioned that was a possible cause.
Also, depending on how you have your /etc/krb5.conf configured, a different KDC could have been contacted.
You can see distinctly in the failure via UDP that there is a socket timeout for each attempt to connect to the KDC. This is a failure at the networking side where a client cannot connect to a server. Since no connection was ever made via UDP, there was no change for it to know to try TCP. That "switching" is done based on a response of KRB5KRB_ERR_RESPONSE_TOO_BIG I believe so if no response is made, no "switching" to TCP will occur.
If you really want to get to the bottom of this, recreate the problem while capturing packets via tcpdump like this:
# tcpdump -i any -w ~/kerberos_broken.pcap port 88
Then, with the problem fixed reproduce again while capturing packets:
# tcpdump -i any -w ~/kerberos_fixed.pcap port 88
Use Wireshark (it does a great job of decoding Kerberos packets) and you will be able to see the entire interaction.
This will show us information to help determine the cause.
Wireshark is here: https://www.wireshark.org/
I am having my own kerberos problem, but I thought I'd share this in case it solves your problem. Cloudera stores its own kerberos keytab in the runtime directory. See if you can authenticate against that keytab. If not, then your runtime keytab is not correct and you may have to redistribute the keytab. (requires shutdown of the roles)
Here is the info you need:
1) One a data node, the runtime keytab is located in /run/cloudera-scm-agent/process/XXX-DATANODE/, for example:
# ls -l */hdfs.keytab
-rw------- 1 hdfs hdfs 1570 Mar 14 23:25 166-hdfs-DATANODE/hdfs.keytab
-rw------- 1 hdfs hdfs 1570 Mar 15 20:28 197-hdfs-DATANODE/hdfs.keytab
-rw------- 1 hdfs hdfs 1570 Mar 15 21:33 203-hdfs-DATANODE/hdfs.keytab
-rw------- 1 hdfs hdfs 1570 Mar 16 18:07 207-hdfs-DATANODE/hdfs.keytab
2) Use kinit to authenticate against the keytab.
# kinit -t hdfs.keytab user/host@realm
If you can successfully authenticate against that keytab, then your keytab is good. I hope this helps. If not, you'll have to redistribute the keytabs.