Hi,
I got an error during an attempt to access a remote cluster secured by Kerberos and I dont know why the client is trying to find out the hdfs principal in the local KDC.
The setup is as follows (intentionally ommit full domain names and host names to keep it tidy):
- each cluster (CLUSTERDEV and CLUSTERPROD) has its own KDC (DEVREALM and PRODREALM)
- the KDC trusts each other (verified by kvno hdfs/<namenodehost>@REMOTE_REALM from boths sides)
- both clusters are running NameNode in HA mode
I have configured the Trusted realm in ClouderaManager for CLUSTERDEV set to CLUSTERPROD. (This triggered the RULE change in auth_to_local for core-site.xml). I have done the same for CLUSTERPROD and set trusted CLUSTERDEV.
- each krb5.conf in CLUSTERDEV has also a PRODREALM in [realms] (I can kinit with "remote" account)
- each krb5.conf has [capaths] DEVREALM = { PRODREALM = . }
- and vice versa, in CLUSTERPROD each krb5.conf the DEVRELAM is added to [realms] plus [capath] PRODREALM = { DEVREALM = . }
Now on the DEV cluster I want to access the PROD cluster:
I have prepared a custom hdfs-site, where I have added the PROD clusters' namenode info into a custom hdfs-site.xml stored in distcpconf (a copy of the actual hadoop-conf on a gateway host)
Tried to test the new configuration (on a gateway host in DEV cluster):
HADOOP_CONF=/home/centos/distcpconf hdfs dfs -ls hdfs://prodnameservice/tmp export HADOOP_CONF=/home/centos/distcpconf hdfs dfs -ls hdfs://prodnameservice/tmp
None of the above worked, the client does not know the prodnameservice.
-ls: java.net.UnknownHostException: prodnameservice
First question: why the client is not taking into the account the modified env variable?
I had to put this custom hdfs-site.xml into /etc/hadoop/conf/ and then it suddenly know what is "prodnameservice".
The hdfs ls returns ( I have logged in as tomas2@PRODREALM on DEV gateway):
PriviledgedActionException as:user@PRODREALM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)]
and in the same time DEV KDC reports:
TGS_REQ (2 etypes {18 17}) 10.85.150.42: LOOKING_UP_SERVER: authtime 0, user@PRODREALM for hdfs/prod.namenode.fqn@DEVREALM, Server not found in Kerberos database
and in the same time PROD KDC reports:
TGS_REQ (2 etypes {18 17}) 10.85.150.42: ISSUE: authtime 1550044165, etypes {rep=18 tkt=18 ses=18}, user@PRODREALM for krbtgt/DEVREALM@PRODREALM
So I dont understand why the client is trying to look for a hdfs/PRODUCTION_NAMENODE principal in DEV KDC. As you can see the PROD KDC correctly reports the ticket granting service for cross realm trust using krbtgt/DEV@PROD.
So I went back to the modified hdfs-site.xml and changed everything from DEV to PROD in these items, so it points now to the PROD:
<property> <name>dfs.namenode.kerberos.principal</name> <value>hdfs/_HOST@PRODREALM</value> </property> <property> <name>dfs.namenode.kerberos.internal.spnego.principal</name> <value>HTTP/_HOST@PRODREALM</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/_HOST@PRODREALM</value> </property>
Run again the ls with the same error results.
Then I reverted this change in hdfs-site.xml and changed the krb5.conf default_realm on the DEV gateway where I try to do the "ls".
After this I was able to do "ls" on the remote cluster, BUT I want to access the remote cluster without changing the default realm in the DEV krb5.conf gateway file.
[centos@ip-10-85-150-42 ~]$ hdfs dfs -ls hdfs://prodnameservice/tmp Found 5 items ... [centos@ip-10-85-150-42 ~]$ kinit tomas2@DEVREALM Password for user@DEVREALM: [centos@ip-10-85-150-42 ~]$ hdfs dfs -ls hdfs://prodnameservice/tmp 19/02/13 09:44:28 INFO util.KerberosName: No auth_to_local rules applied to user@DEVREALM Found 5 items ...
as the client reports in the second case, the auth_to_local is not applied. But as I said before,the "Trusted Kerberos Realms"
<name>hadoop.security.auth_to_local</name> <value> RULE:[1:$1@$0](.*@\QDEVLREAM\E$)s/@\QDEVREALM\E$// RULE:[2:$1@$0](.*@\QDEVREALM\E$)s/@\QDEVREALM\E$// DEFAULT </value>
(and the same rules are in DEV cluster but just with the opposite REALM).
Why it is not using the RULEs from core-site.xml?
And the most important question, why the hdfs client is trying to find the prod namenode in the DEV KDC? How can I do "ls" form DEV gateway without chaning the default realm?
Thanks for any advise,
T.
Created 02-18-2019 12:23 AM
After many searches I think I have found the solution. None of the blogs on Cloudera or Hortonworks states, the solution, because I think in all cases the hosts running the clusters are using custom DNS. Thus the krb5.conf nicely maps with the cluster's REALM, or if not then a simple line of conf makes sure the mapping.
In my case all the host names are managed by AWS DNS, thus no custom domain names used. This was the reason why my client tried to look up for the namenode in the local KDC, because it used the default_realm to get the service ticket. But after adding into krb5.conf in DEV node:
[domain_realm] ip-xx-xx-xx-xx.eu-west-1.compute.internal = PRODREALM
ip-xx-xx-xx-xx.eu-west-1.compute.internal = PRODREALM
i.e.: <fully_qualified_host_name_of_remote_namenode1> = <REMOTE REALM>
<fully_qualified_host_name_of_remote_namenode2> = <REMOTE REALM>
I was able to ls the remote HDFS. Now to the High availability part, I had to add adintional nameservice info into hdfs-site.xml:
dfs.ha.namenodes.hanameservice <- ADD here the remote nameservice dfs.namenode.rpc-address.* <- Add the remote nameservice FQDNs dfs.namenode.https-address.* <- Add the remote nameservice FQDNs dfs.namenode.http-address.* <- Add the remote nameservice FQDNs dfs.namenode.servicerpc-address.* <- Add the remote nameservice FQDNs
The I was able to use HA nameservice name to access the HDFS.
Also during distcp I had to use the:
-Dmapreduce.job.hdfs-servers.token-renewal.exclude=name_of_the_prod_nameservice
when launching a distcp from dev and copying data from prod to dev.
And the answer to the last question, regarding HADOOP_CONF - I am not sure here, but I think hdfs scripts in cloudera bin are overriding this env variable, so regardless what you set to HADOOP_CONF, it will not be applied. So when Cloudera's guide states:
export HADOOP_CONF_DIR=path_to_working_directory
you have to be sure, that the script does not override this setting.
Tested the corss realm auth based on suggestion from HarshJ:
kinit user@REMOTEREALM
kvno hdfs/namenode-host@LOCALREALM
Running on CLUSTERDEV:
kinit user@PRODREALM kvno hdfs/dev_name_node_host@DEVRELAM klist Ticket cache: FILE:/tmp/krb5cc_1000 Default principal: user@PRODREALM Valid starting Expires Service principal 02/13/2019 13:59:41 02/14/2019 13:59:41 krbtgt/PRODREALM@PRODREALM renew until 02/20/2019 13:59:41 02/13/2019 14:00:05 02/14/2019 13:59:41 krbtgt/DEVREALM@PRODREALM renew until 02/20/2019 13:59:41 02/13/2019 14:00:55 02/14/2019 13:59:41 hdfs/<dev namenode fqdn>@DEVREALM renew until 02/18/2019 14:00:55
-> OK.
Created 02-18-2019 12:23 AM
After many searches I think I have found the solution. None of the blogs on Cloudera or Hortonworks states, the solution, because I think in all cases the hosts running the clusters are using custom DNS. Thus the krb5.conf nicely maps with the cluster's REALM, or if not then a simple line of conf makes sure the mapping.
In my case all the host names are managed by AWS DNS, thus no custom domain names used. This was the reason why my client tried to look up for the namenode in the local KDC, because it used the default_realm to get the service ticket. But after adding into krb5.conf in DEV node:
[domain_realm] ip-xx-xx-xx-xx.eu-west-1.compute.internal = PRODREALM
ip-xx-xx-xx-xx.eu-west-1.compute.internal = PRODREALM
i.e.: <fully_qualified_host_name_of_remote_namenode1> = <REMOTE REALM>
<fully_qualified_host_name_of_remote_namenode2> = <REMOTE REALM>
I was able to ls the remote HDFS. Now to the High availability part, I had to add adintional nameservice info into hdfs-site.xml:
dfs.ha.namenodes.hanameservice <- ADD here the remote nameservice dfs.namenode.rpc-address.* <- Add the remote nameservice FQDNs dfs.namenode.https-address.* <- Add the remote nameservice FQDNs dfs.namenode.http-address.* <- Add the remote nameservice FQDNs dfs.namenode.servicerpc-address.* <- Add the remote nameservice FQDNs
The I was able to use HA nameservice name to access the HDFS.
Also during distcp I had to use the:
-Dmapreduce.job.hdfs-servers.token-renewal.exclude=name_of_the_prod_nameservice
when launching a distcp from dev and copying data from prod to dev.
And the answer to the last question, regarding HADOOP_CONF - I am not sure here, but I think hdfs scripts in cloudera bin are overriding this env variable, so regardless what you set to HADOOP_CONF, it will not be applied. So when Cloudera's guide states:
export HADOOP_CONF_DIR=path_to_working_directory
you have to be sure, that the script does not override this setting.
Created 02-18-2019 06:06 AM
Congratulations on solving your issue and thank you for sharing it for others who may run into something similar. 🙂