Created 04-04-2018 07:35 AM
Hi,
we have a Kerberos secured cluster and currently facing
issues with Ambari Metrics.
After starting Ambari Metrics everythin is fine but after a
couple of days we get alerts from Ambari like this:
NameNode Service RPC Processing Latency (Hourly) Unable to retrieve metrics from the Ambari Metrics service.
When I check the logs oft he Metrics Collector I can find entries like:
2018-03-28 11:19:47,013 WARN org.apache.hadoop.security.UserGroupInformation: Exception encountered while running the renewal command for amshbase/s0202.cl.psiori.com@PSIORI.COM. (TGT end time:1522228847000, renewalFailures: org.apache.hadoop.metrics2.lib.MutableGaugeInt@388f50cd,renewalFailuresTotal: org.apache.hadoop.metrics2.lib.MutableGaugeLong@7d8dc9b8) ExitCodeException exitCode=1: kinit: KDC can't fulfill requested option while renewing credentials at org.apache.hadoop.util.Shell.runCommand(Shell.java:954) at org.apache.hadoop.util.Shell.run(Shell.java:855) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1163) at org.apache.hadoop.util.Shell.execCommand(Shell.java:1257) at org.apache.hadoop.util.Shell.execCommand(Shell.java:1239) at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:987) at java.lang.Thread.run(Thread.java:745) 2018-03-28 11:19:47,014 ERROR org.apache.hadoop.security.UserGroupInformation: TGT is expired. Aborting renew thread for amshbase/s0202.cl.psiori.com@PSIORI.COM.
In the following I then see aggregation errors:
2018-03-28 11:27:08,188 INFO TimelineClusterAggregatorMinute: Started Timeline aggregator thread @ Wed Mar 28 11:27:08 CEST 2018 2018-03-28 11:27:08,189 INFO TimelineClusterAggregatorMinute: Skipping aggregation function not owned by this instance. 2018-03-28 11:27:08,205 ERROR TimelineMetricHostAggregatorHourly: Exception during aggregating metrics. java.sql.SQLTimeoutException: Operation timed out. at org.apache.phoenix.exception.SQLExceptionCode$14.newException(SQLExceptionCode.java:364) at org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:150) at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:831)
So this seems to be related to Kerberos. When I check the log oft he KDC there is not much info:
Mar 28 11:19:47 sql.cl.psiori.com krb5kdc[879](info): TGS_REQ (8 etypes {18 17 20 19 16 23 25 26}) 10.11.1.21: TICKET NOT RENEWABLE: authtime 0, amshbase/s0202.cl.psiori.com@PSIORI.COM for krbtgt/PSIORI.COM@PSIORI.COM, KDC can't fulfill requested option ... Mar 28 11:20:48 sql.cl.psiori.com krb5kdc[879](info): AS_REQ (4 etypes {18 17 16 23}) 10.11.1.21: ISSUE: authtime 1522228848, etypes {rep=18 tkt=18 ses=18}, amshbase/s0202.cl.psiori.com@PSIORI.COM for krbtgt/PSIORI.COM@PSIORI.COM Mar 28 11:20:48 sql.cl.psiori.com krb5kdc[879](info): TGS_REQ (4 etypes {18 17 16 23}) 10.11.1.21: ISSUE: authtime 1522228848, etypes {rep=18 tkt=18 ses=18}, amshbase/s0202.cl.psiori.com@PSIORI.COM for nn/m0201.cl.psiori.com@PSIORI.COM
When I check the principal amshbase/s0202.cl.psiori.com@PSIORI.COM in the KDC I get the following:
Principal: amshbase/s0202.cl.psiori.com@PSIORI.COM Expiration date: [never] Last password change: Mo Mär 19 11:24:05 CET 2018 Password expiration date: [never] Maximum ticket life: 1 day 00:00:00 Maximum renewable life: 0 days 00:00:00 Last modified: Mo Mär 19 11:24:05 CET 2018 (admin/admin@PSIORI.COM) Last successful authentication: [never] Last failed authentication: [never] Failed password attempts: 0 Number of keys: 2 Key: vno 1, aes256-cts-hmac-sha1-96 Key: vno 1, aes128-cts-hmac-sha1-96 MKey: vno 1 Attributes: Policy: [none]
Ist hat normal? Maximum renewable life is set to 0 so ticket renewal is not possible. But that is also true for all other principals in the KDC and all other services work normally.
This is the content of krb5.conf:
[libdefaults] renew_lifetime = 7d forwardable = true default_realm = PSIORI.COM ticket_lifetime = 24h dns_lookup_realm = false dns_lookup_kdc = false default_ccache_name = /tmp/krb5cc_%{uid} #default_tgs_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5 #default_tkt_enctypes = aes des3-cbc-sha1 rc4 des-cbc-md5 [domain_realm] .cl.psiori.com = PSIORI.COM cl.psiori.com = PSIORI.COM [logging] default = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log kdc = FILE:/var/log/krb5kdc.log [realms] PSIORI.COM = { admin_server = sql.cl.psiori.com kdc = sql.cl.psiori.com }
I have not applied any changes to the kdc.conf so it has the default content:
[kdcdefaults] kdc_ports = 88 kdc_tcp_ports = 88 [realms] EXAMPLE.COM = { #master_key_type = aes256-cts acl_file = /var/kerberos/krb5kdc/kadm5.acl dict_file = /usr/share/dict/words admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal }
Is there any misconfiguration?
Unfortunately the Hortonworks installation docu doesn't give detailed information about how to configure Kerberos KDC correctly, it just forwards to the official MIT KDC docu.
When I restart the service then everything is fine again (for some time).
Any suggestions or help is very welcome.
Best regards,
Alex
Created 04-04-2018 09:23 AM
As the error says that "TICKET NOT RENEWABLE". Which can happen if the Principal is not having the Renewable attribute.
Please check the principal attributes something like this, using "admin.local" utility
kadmin: getprinc <PRINCIPAL>
Also if you find that it is not having the renewable attributes to it then please modify the principle and add the renewable flags to is something like following:
# modprinc -maxrenewlife "7 days" +allow_renewable krbtgt/XXXX.COM@XXXX.COM # modprinc -maxrenewlife "7 days" +allow_renewable "amshbase/s0202.cl.xxxx.com@XXXX.COM"
.
Please use the correct principal names. I have used the masked principal name in the above sample commands.
After executing the above commands to modify the principals please run the "kinit -R" to see if its still throwing error while renewing ?
.
Created 04-04-2018 09:23 AM
As the error says that "TICKET NOT RENEWABLE". Which can happen if the Principal is not having the Renewable attribute.
Please check the principal attributes something like this, using "admin.local" utility
kadmin: getprinc <PRINCIPAL>
Also if you find that it is not having the renewable attributes to it then please modify the principle and add the renewable flags to is something like following:
# modprinc -maxrenewlife "7 days" +allow_renewable krbtgt/XXXX.COM@XXXX.COM # modprinc -maxrenewlife "7 days" +allow_renewable "amshbase/s0202.cl.xxxx.com@XXXX.COM"
.
Please use the correct principal names. I have used the masked principal name in the above sample commands.
After executing the above commands to modify the principals please run the "kinit -R" to see if its still throwing error while renewing ?
.
Created 04-04-2018 09:32 AM
I modified the principals such that they can issue renewable tickets. I don't get errors now when renewing the ticket.
But I'm wondering why this is necessary at all? None of the principals in the KDC can issue renewable tickets and all other services work fine. If a ticket is not renewable, the service could simply request a new ticket. Or do I misunderstand something here?
Created 04-04-2018 09:35 AM
Also please check your KDC configuration to verify if it has default setting something like following or not?
# cat /var/kerberos/krb5kdc/kdc.conf max_life = 24h 0m 0s max_renewable_life = 7d 0h 0m 0s default_principal_flags = +renewable, +forwardable
.
Created 04-04-2018 12:53 PM
I updated the KDC configuration. But I had to create a realm definiton in kdc.conf as well under [realms], just putting the configuration values under [kdcdefaults] didn't help.
But still, I'm confused why this is necessary at all. Why does the Metrics Collector not simply issue a new ticket instead of renewing it?