Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Namenode not starting after Kerberos setup on a HDP 2.6 cluster.

avatar
Contributor
I have installed MIT kerberos on one Linux server and through Ambari's automated way we tried to kerberise our dev cluster.
Amabri created all the principals for each node[3 datanode,2namenode and one edge node] and i can see them in KDC.
While starting all services on last step it failed , Namenode services are not coming up.
Before proceeding this on our dev cluster I have done same activities on Sandbox and it worked. 

But on cluster there is a slight change,it is HA cluster and for each node we have two IP's , one is external on which we can do ssh and login and other is internal IP for each node for internal communication through infiniband.

NAMENODE ERROR MSG:-

2018-04-01 16:19:26,580 - call['hdfs haadmin -ns ABCHADOOP01 -getServiceState nn2'] {'logoutput': True, 'user': 'hdfs'}
18/04/01 16:19:28 INFO ipc.Client: Retrying connect to server: c1master02-nn.abc.corp/29.6.6.17:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
Operation failed: Call From c1master01-nn.abc.corp/29.6.6.16 to c1master02-nn.abc.corp:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
2018-04-01 16:19:28,783 - call returned (255, '18/04/01 16:19:28 INFO ipc.Client: Retrying connect to server: c1master02-nn.abc.corp/29.6.6.16:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)\nOperation failed: Call From c1master01-nn.abc.corp/29.6.6.16 to c1master02-nn.abc.corp:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused')
2018-04-01 16:19:28,783 - NameNode HA states: active_namenodes = [], standby_namenodes = [], unknown_namenodes = [('nn1', 'c1master01-nn.abc.corp:50070'), ('nn2', 'c1master02-nn.abc.corp:50070')]
2018-04-01 16:19:28,783 - Will retry 2 time(s), caught exception: No active NameNode was found.. Sleeping for 5 sec(s)
2018-04-01 16:19:33,787 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl --negotiate -u : -s '"'"'http://c1master01-nn.abc.corp:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpKVcTXy 2>/tmp/tmpy6hgoj''] {'quiet': False}
2018-04-01 16:19:33,837 - call returned (7, '')
2018-04-01 16:19:33,837 - Getting jmx metrics from NN failed. URL: http://c1master01-nn.abc.corp:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx
    _, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)
  File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output
    raise ExecutionFailed(err_msg, code, files_output[0], files_output[1])
ExecutionFailed: Execution of 'curl --negotiate -u : -s 'http://c1master01-nn.abc.corp:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmpKVcTXy 2>/tmp/tmpy6hgoj' returned 7. 
2018-04-01 16:19:33,837 - call['hdfs haadmin -ns ABCHADOOP01 -getServiceState nn1'] {'logoutput': True, 'user': 'hdfs'}
Command failed after 1 tries
  Do not show this dialog again when starting a background operationOK
Licensed under the Apache License, Version 2.0.
See third-party tools/resources that Ambari uses and their respective authors
-From each node i am able to do kadmin and add list princs.
-I have done ssh on Namenode and tried to obtain ticket , it also worked.
abc># kinit  -kt /etc/security/keytabs/nn.service.keytab nn/c1master01-nn.abc.corp@ABCHDP.COM
abc># klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: nn/c1master01-nn.abc.corp@ABCHDP.COM
Valid starting     Expires            Service principal
04/01/18 16:03:42  04/02/18 16:03:42  krbtgt/ABCHDP.COM@ABCHDP.COM
        renew until 04/01/18 16:03:42

Since the cluster is empty and tried hadoop namenode -format as well But got below issue:-

java.io.IOException: Login failure for nn/c1master01-nn.abc.corp@ABCHDP.COM from keytab /etc/security/keytabs/nn.service.keytab: javax.security.auth.login.LoginException: Receive timed out
        at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1098)
        at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:307)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1160)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1631)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1769)
Caused by: javax.security.auth.login.LoginException: Receive timed out
        at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:808)
        at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
        at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
        at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
        at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
        at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
        at org.apache.hadoop.security.UserGroupInformation.loginUserFromKeytab(UserGroupInformation.java:1089)
        ... 4 more
Caused by: java.net.SocketTimeoutException: Receive timed out
        at java.net.PlainDatagramSocketImpl.receive0(Native Method)
        at java.net.AbstractPlainDatagramSocketImpl.receive(AbstractPlainDatagramSocketImpl.java:143)
        at java.net.DatagramSocket.receive(DatagramSocket.java:812)
        at sun.security.krb5.internal.UDPClient.receive(NetClient.java:206)
        at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:411)
        at sun.security.krb5.KdcComm$KdcCommunication.run(KdcComm.java:364)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.security.krb5.KdcComm.send(KdcComm.java:348)
        at sun.security.krb5.KdcComm.sendIfPossible(KdcComm.java:253)
        at sun.security.krb5.KdcComm.send(KdcComm.java:229)
        at sun.security.krb5.KdcComm.send(KdcComm.java:200)
        at sun.security.krb5.KrbAsReqBuilder.send(KrbAsReqBuilder.java:316)
        at sun.security.krb5.KrbAsReqBuilder.action(KrbAsReqBuilder.java:361)
        at com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:776)
        ... 17 more
18/04/01 15:45:03 INFO util.ExitUtil: Exiting with status 1
18/04/01 15:45:03 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at c1master01-nn.abc.corp/29.6.6.17

This 29.6.6.17 is the internal IP . Can anybody tell me whats the issue??

Do i need to manually add entry for internal IP's in KDC ??If required why Amabri haven't added it to KDC like it does for external ips??

In case required , since every machine is having only one hostname , why we need two entries??

1 ACCEPTED SOLUTION

avatar
Master Mentor

@Anwaar Siddiqui

Kerberos KDC listens on both TCP and UDP channel on port 88 (default). By default, the Namenode tries to connect to Kerberos KDC over UDP.

How to force the Kerberos library to use TCP:
1. Go to Ambari UI. Then Services > Kerberos > Configs.
2. In the 'Advanced krb5-conf section, look for 'krb5-conf Template' field. Under [libdefaults] stanza, add 'udp_preference_limit = 1'
3. Save config and restart the affected component.
4. This will force Kerberos to use TCP.

Can you share the output of

# iptables -nvL

If you don't see UDP port 88 add the following

# iptables -I INPUT 5 -p udp --dport 88 -j ACCEPT

Rerun the first command you should now see a line like this

0822 2908K ACCEPT udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:88


View solution in original post

2 REPLIES 2

avatar
Master Mentor

@Anwaar Siddiqui

Kerberos KDC listens on both TCP and UDP channel on port 88 (default). By default, the Namenode tries to connect to Kerberos KDC over UDP.

How to force the Kerberos library to use TCP:
1. Go to Ambari UI. Then Services > Kerberos > Configs.
2. In the 'Advanced krb5-conf section, look for 'krb5-conf Template' field. Under [libdefaults] stanza, add 'udp_preference_limit = 1'
3. Save config and restart the affected component.
4. This will force Kerberos to use TCP.

Can you share the output of

# iptables -nvL

If you don't see UDP port 88 add the following

# iptables -I INPUT 5 -p udp --dport 88 -j ACCEPT

Rerun the first command you should now see a line like this

0822 2908K ACCEPT udp -- * * 0.0.0.0/0 0.0.0.0/0 udp dpt:88


avatar
Contributor

@Geoffrey Shelton Okot ..Thanks for the update.

It worked, also want to add one thing that one of my namenode port was occupied by previous running instance[java.net.BindException: Port in use: 0.0.0.0:50070]and the Ambari

was not showing any message for that , so checked my namenode logs on the server itself.

After killing the old PID and restart did the trick.