No controller for Kafka cluster after kerberos

Cluster was kerberized. After disabling Kerberos for a fix and kerberizing again, Kafka topics go not get in sync:

Topic:ATLAS_HOOK PartitionCount:1 ReplicationFactor:2 Configs:
Topic: ATLAS_HOOK Partition: 0 Leader: 1003 Replicas: 1001,1003 Isr: 1003

Topic:ATLAS_ENTITIES PartitionCount:1 ReplicationFactor:2 Configs:
Topic: ATLAS_ENTITIES Partition: 0 Leader: -1 Replicas: 1001,1002 Isr: 1001

Controller logs shows the following error:

[2018-07-26 02:50:04,971] WARN Failed to parse the controller info as json. Probably this controller is still using the old format [null] to store the broker id in zookeeper (kafka.controller.KafkaController$)
[2018-07-26 02:50:04,972] ERROR [controller-event-thread]: Error processing event Startup (kafka.controller.ControllerEventManager$ControllerEventThread)
kafka.common.KafkaException: Failed to parse the controller info: null. This is neither the new or the old format.
at kafka.controller.KafkaController$.parseControllerId(KafkaController.scala:147)
at kafka.controller.KafkaController.getControllerID(KafkaController.scala:1198)
at kafka.controller.KafkaController.elect(KafkaController.scala:1662)
at kafka.controller.KafkaController$Startup$.process(KafkaController.scala:1581)
at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:53)
at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:53)
at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:53)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:52)
Caused by: java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(
at java.lang.Integer.parseInt(
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:273)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
... 10 more

using /get in zkCli shows "null" as result.

Also tried to create topics using "kafka" as principal, but they do not get a leader.

Any clues?

Thank you!


Answering to myself: It seems something happens in this rekerberization procedure (reproducible in two clusters) and /controller gets null. I deleted the znode and restarted the brokers. A new controller was elected and the replicas caught up into isrs. Hope this proves useful to someone.

Another +1 from me for  the response. Spend couple of hours investigating and comparing configuration with working cluster before I removed the path. The worst part is that I was not able to find any indication in kafka/zookeeper logs that there is something wrong. 

@Ricardo Junior

Thanks for your answer to yourself 🙂 It helped me after many many many hours of Kafka debugging.

BTW; I my case it was exactly the same scenario: Kerberos -> De-Kerberize -> Re-Kerberize
