Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

I have seen phantom or false alerts being generated by Ambari Alerts that clear on the next testing cycle without any evidence of an actual outage of the component. This issue may be related to a race condition within the python scripts where the check for whether the kerberos ticket is still valid occurs right at the end of the 5 minute lifetime and becomes invalid before the actual alert check can be run against the component.

This causes the script to error-out which is taken as a failure of the alert check. In order to try to avoid this race condition, it has been suggested and there has been some early evidence to support the fact that increasing the kerberos lifespan setting to be a different, longer time than the alert check interval (which by default is 5 minutes).

The parameter "-l 5m" (that's a lower-case L) is used within these two python scripts to set the lifetime to 5 minutes:

  • /usr/lib/ambari-server/lib/resource_management/libraries/functions/curl_krb_request.py
  • /var/lib/ambari-server/resources/common-services/OOZIE/<version>/package/alerts/alert_check_oozie_server.py

We want to change extend that lifetime setting to be 5 minutes and 20 seconds or "-l 5m20s".

For the alert_check_oozie_server.py script, the parameter is found on line 79 of the default version of the script provided with Ambari 2.1.2:

On the Ambari-Server node, edit the file:

/usr/lib/ambari-server/lib/resource_management/libraries/functions/curl_krb_request.py

and change the following line (line 79 of the default version of the file):

shell.checked_call("{0} -l 5m -c {1} -kt {2} {3} > /dev/null".format(kinit_path_local, ccache_file_path, keytab, principal), user=user) 

to look like this:

shell.checked_call("{0} -l 5m20s -c {1} -kt {2} {3} > /dev/null".format(kinit_path_local, ccache_file_path, keytab, principal), user=user) 

Specifically, the "5m" in the original line represents 5 minutes for the Kerberos lifespan parameter, while the "5m20s" in the modified line represents 5 minutes 20 seconds.

The modification procedure for the alert_check_oozie_server.py script is similar to the curl_krb_request.py script and looks like this:

On the Ambari-Server node, edit the file:

/var/lib/ambari-server/resources/common-services/OOZIE/<version>/package/alerts/alert_check_oozie_server.py

and change the following line (line 146 of the default version of the file):

kinit_command = format("{kinit_path_local} -l 5m -kt {smokeuser_keytab} {smokeuser_principal}; ") 

to look like this:

kinit_command = format("{kinit_path_local} -l 5m20s -kt {smokeuser_keytab} {smokeuser_principal}; ") 

This makes the same configuration change for Oozie that was made for the other component alerts where we increase the Kerberos lifetime from 5 minutes to 5 minutes 20 seconds.

Once both files are updated, you will need to restart ambari-server:

ambari-server stop 
ambari-server start 

The update will be cluster-wide after the ambari-server restart.

3,584 Views
Comments

After dealing with this a few more times, it was pointed out to me that making the interval 5minutes 19seconds would be even better to help avoid the race condition since 19 is a prime number and the number of instances that a multiple of 19 seconds equals a multiple of 60 seconds is much less numerous than when using 20 seconds.

Again, the idea is not that there isn't enough time provided -- it's that we want to avoid the kinit timing out just as we're trying to do the transaction.

Iam facing the same issue in my production which is having Ambari 2.1.2. I have a question if it is python kerbores issue. why we are editing OOZIE python file. Can you explain please