When a Kafka cluster is over-subscribed, the loss of a single broker can be a jarring experience for the cluster as a whole. This is especially true when trying to bring a previously failed broker back into a cluster.
In order to help mitigate some of the impact of returning a broker to a cluster when that broker has been out of the cluster for a number of days, removing the broker ID of the broker ready to re-enter the cluster from the Replicas list of all partitions can help.
Generally, you want a Kafka cluster that is sized properly in order to handle single node failures, but as is often the case the size of the use case on the Kafka cluster can quickly start to exceed the physical limitations. In those situations when you're waiting for new hardware to arrive to augment your cluster, you still need to keep the existing cluster working as well as possible.
This collection of script, which are playfully called Kawkfa, are still alpha at best and have their bugs, but someone may find them useful in the above situation.
The high level procedure is as follows:
For each partition entry that includes the broker.id of the failed node, remove that broker ID from the Replicas list
Bring the wayward broker back into the cluster
Add back the wayward broker ID to the Replicas list, but do so without making it the preferred replica
Once the broker had been added back to its partitions, then make the broker the preferred replica for a random number of the partitions
Caveats about the scripts:
You are using the scripts at your own risk. Just be careful and understand what the scripts are doing prior to use
There are bugs in the script -- most notable is that it adds an extra comma at the end of the last partition entry that should not be there. Simply removing that comma will allow the JSON file to be properly read