Member since
03-07-2019
24
Posts
14
Kudos Received
1
Solution
05-01-2019
08:25 PM
Hi Vedant! You state: num.io.threads should be greater than the number of disks dedicated for Kafka. I strongly recommend to start with same number of disks first. Is num.io.threads to be calculated as the number of disks per node allocated to Kafka or the total number of disk for Kafka for the entire cluster? I'm guessing disks per node dedicated for Kafka, but I wanted to confirm. Thanks, Jeff G.
... View more
08-17-2018
07:57 PM
The option to use a file that contains the node FQDNs does not seem to work. The command thinks that the filename is a node FQDN. Also, what is the format of the clusternodes1.txt? One host per line or all hosts comma separated on one line?
... View more
07-13-2017
02:58 PM
1 Kudo
When a Kafka cluster is over-subscribed, the loss of a single broker can be a jarring experience for the cluster as a whole. This is especially true when trying to bring a previously failed broker back into a cluster. In order to help mitigate some of the impact of returning a broker to a cluster when that broker has been out of the cluster for a number of days, removing the broker ID of the broker ready to re-enter the cluster from the Replicas list of all partitions can help. Generally, you want a Kafka cluster that is sized properly in order to handle single node failures, but as is often the case the size of the use case on the Kafka cluster can quickly start to exceed the physical limitations. In those situations when you're waiting for new hardware to arrive to augment your cluster, you still need to keep the existing cluster working as well as possible. To that end, there are some AWK scripts that are available on Github that help create the JSON files needed to essentially spoon feed partitions back on to a broker. This collection of script, which are playfully called Kawkfa, are still alpha at best and have their bugs, but someone may find them useful in the above situation. The high level procedure is as follows: For each partition entry that includes the broker.id of the failed node, remove that broker ID from the Replicas list Bring the wayward broker back into the cluster Add back the wayward broker ID to the Replicas list, but do so without making it the preferred replica Once the broker had been added back to its partitions, then make the broker the preferred replica for a random number of the partitions Caveats about the scripts: You are using the scripts at your own risk. Just be careful and understand what the scripts are doing prior to use There are bugs in the script -- most notable is that it adds an extra comma at the end of the last partition entry that should not be there. Simply removing that comma will allow the JSON file to be properly read Have fun!
... View more
Labels:
04-20-2017
08:57 PM
Note that the <strong> and </strong> strings in the code block above should be removed, since they are HTML formatting commands that somehow became visible in the formatted text of the code block.
... View more
03-03-2016
03:40 PM
After dealing with this a few more times, it was pointed out to me that making the interval 5minutes 19seconds would be even better to help avoid the race condition since 19 is a prime number and the number of instances that a multiple of 19 seconds equals a multiple of 60 seconds is much less numerous than when using 20 seconds. Again, the idea is not that there isn't enough time provided -- it's that we want to avoid the kinit timing out just as we're trying to do the transaction.
... View more
01-18-2016
10:54 PM
6 Kudos
I have seen phantom or false alerts being generated by Ambari Alerts that clear on the next testing cycle without any evidence of an actual outage of the component. This issue may be related to a race condition within the python scripts where the check for whether the kerberos ticket is still valid occurs right at the end of the 5 minute lifetime and becomes invalid before the actual alert check can be run against the component. This causes the script to error-out which is taken as a failure of the alert check. In order to try to avoid this race condition, it has been suggested and there has been some early evidence to support the fact that increasing the kerberos lifespan setting to be a different, longer time than the alert check interval (which by default is 5 minutes). The parameter "-l 5m" (that's a lower-case L) is used within these two python scripts to set the lifetime to 5 minutes: /usr/lib/ambari-server/lib/resource_management/libraries/functions/curl_krb_request.py /var/lib/ambari-server/resources/common-services/OOZIE/<version>/package/alerts/alert_check_oozie_server.py
We want to change extend that lifetime setting to be 5 minutes and 20 seconds or "-l 5m20s". For the alert_check_oozie_server.py script, the parameter is found on line 79 of the default version of the script provided with Ambari 2.1.2: On the Ambari-Server node, edit the file: /usr/lib/ambari-server/lib/resource_management/libraries/functions/curl_krb_request.py
and change the following line (line 79 of the default version of the file): shell.checked_call("{0} -l 5m -c {1} -kt {2} {3} > /dev/null".format(kinit_path_local, ccache_file_path, keytab, principal), user=user)
to look like this: shell.checked_call("{0} -l 5m20s -c {1} -kt {2} {3} > /dev/null".format(kinit_path_local, ccache_file_path, keytab, principal), user=user) Specifically, the "5m" in the original line represents 5 minutes for the Kerberos lifespan parameter, while the "5m20s" in the modified line represents 5 minutes 20 seconds. The modification procedure for the alert_check_oozie_server.py script is similar to the curl_krb_request.py script and looks like this: On the Ambari-Server node, edit the file: /var/lib/ambari-server/resources/common-services/OOZIE/<version>/package/alerts/alert_check_oozie_server.py and change the following line (line 146 of the default version of the file): kinit_command = format("{kinit_path_local} -l 5m -kt {smokeuser_keytab} {smokeuser_principal}; ") to look like this: kinit_command = format("{kinit_path_local} -l 5m20s -kt {smokeuser_keytab} {smokeuser_principal}; ") This makes the same configuration change for Oozie that was made for the other component alerts where we increase the Kerberos lifetime from 5 minutes to 5 minutes 20 seconds. Once both files are updated, you will need to restart ambari-server: ambari-server stop
ambari-server start The update will be cluster-wide after the ambari-server restart.
... View more
Labels: