Reply
Highlighted
Explorer
Posts: 20
Registered: ‎08-17-2018

Kafka active controllers failing

[ Edited ]

Hi,

 

I have installed Kafka in Cloudera and it has been pretty stable up until this week.  The first time the leader failed I added replicas to the NavigatorAuditEvents to contain 3 replicas as the default was 1.  It appears the __consumer_offsets partitions keep dropping out for some reason.  I can't figure out the regex process in the log search tool and can't get to the logs as they are in a secure location (/var/logs/).

I would like to know how to go about researching this problem and how to fire up a down active controller.  The following listing is the current state of the topics:

Topic:NavigatorAuditEvents PartitionCount:1 ReplicationFactor:3 Configs:
Topic: NavigatorAuditEvents Partition: 0 Leader: 358 Replicas: 358,357,359 Isr: 357,359,358
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:3 Configs:segment.bytes=104857600,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 356 Replicas: 356,361,362 Isr: 356,362,361
Topic: __consumer_offsets Partition: 1 Leader: 357 Replicas: 357,362,356 Isr: 356,357,362
Topic: __consumer_offsets Partition: 2 Leader: 358 Replicas: 358,356,357 Isr: 356,358,357
Topic: __consumer_offsets Partition: 3 Leader: 359 Replicas: 359,357,358 Isr: 357,359,358
Topic: __consumer_offsets Partition: 4 Leader: 360 Replicas: 360,358,359 Isr: 359,358,360
Topic: __consumer_offsets Partition: 5 Leader: 361 Replicas: 361,359,360 Isr: 359,361,360
Topic: __consumer_offsets Partition: 6 Leader: 361 Replicas: 362,360,361 Isr: 361,360
Topic: __consumer_offsets Partition: 7 Leader: 356 Replicas: 356,362,357 Isr: 356,357,362
Topic: __consumer_offsets Partition: 8 Leader: 357 Replicas: 357,356,358 Isr: 356,358,357
Topic: __consumer_offsets Partition: 9 Leader: 358 Replicas: 358,357,359 Isr: 357,359,358
Topic: __consumer_offsets Partition: 10 Leader: 359 Replicas: 359,358,360 Isr: 360,358,359
Topic: __consumer_offsets Partition: 11 Leader: 360 Replicas: 360,359,361 Isr: 359,361,360
Topic: __consumer_offsets Partition: 12 Leader: -1 Replicas: 361,360,362 Isr: 362
Topic: __consumer_offsets Partition: 13 Leader: 356 Replicas: 362,361,356 Isr: 356,361
Topic: __consumer_offsets Partition: 14 Leader: 356 Replicas: 356,357,358 Isr: 356,358,357
Topic: __consumer_offsets Partition: 15 Leader: 357 Replicas: 357,358,359 Isr: 359,358,357
Topic: __consumer_offsets Partition: 16 Leader: 358 Replicas: 358,359,360 Isr: 360,358,359
Topic: __consumer_offsets Partition: 17 Leader: 359 Replicas: 359,360,361 Isr: 361,359,360
Topic: __consumer_offsets Partition: 18 Leader: 360 Replicas: 360,361,362 Isr: 361,360,362
Topic: __consumer_offsets Partition: 19 Leader: 361 Replicas: 361,362,356 Isr: 356,361,362
Topic: __consumer_offsets Partition: 20 Leader: 356 Replicas: 362,356,357 Isr: 356,357
Topic: __consumer_offsets Partition: 21 Leader: 356 Replicas: 356,358,359 Isr: 359,356,358
Topic: __consumer_offsets Partition: 22 Leader: 357 Replicas: 357,359,360 Isr: 359,357,360
Topic: __consumer_offsets Partition: 23 Leader: 358 Replicas: 358,360,361 Isr: 361,360,358
Topic: __consumer_offsets Partition: 24 Leader: 359 Replicas: 359,361,362 Isr: 359,361
Topic: __consumer_offsets Partition: 25 Leader: 360 Replicas: 360,362,356 Isr: 356,360,362
Topic: __consumer_offsets Partition: 26 Leader: 361 Replicas: 361,356,357 Isr: 356,361,357
Topic: __consumer_offsets Partition: 27 Leader: 358 Replicas: 362,357,358 Isr: 358,357
Topic: __consumer_offsets Partition: 28 Leader: 356 Replicas: 356,359,360 Isr: 356,359,360
Topic: __consumer_offsets Partition: 29 Leader: 357 Replicas: 357,360,361 Isr: 361,357,360
Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 358,361,362 Isr: 362
Topic: __consumer_offsets Partition: 31 Leader: 359 Replicas: 359,362,356 Isr: 356,359
Topic: __consumer_offsets Partition: 32 Leader: 360 Replicas: 360,356,357 Isr: 356,357,360
Topic: __consumer_offsets Partition: 33 Leader: 361 Replicas: 361,357,358 Isr: 358,361,357
Topic: __consumer_offsets Partition: 34 Leader: 359 Replicas: 362,358,359 Isr: 359,358
Topic: __consumer_offsets Partition: 35 Leader: 356 Replicas: 356,360,361 Isr: 356,361,360
Topic: __consumer_offsets Partition: 36 Leader: 357 Replicas: 357,361,362 Isr: 361,357,362
Topic: __consumer_offsets Partition: 37 Leader: 358 Replicas: 358,362,356 Isr: 356,358
Topic: __consumer_offsets Partition: 38 Leader: 359 Replicas: 359,356,357 Isr: 356,359,357
Topic: __consumer_offsets Partition: 39 Leader: 360 Replicas: 360,357,358 Isr: 358,360,357
Topic: __consumer_offsets Partition: 40 Leader: 361 Replicas: 361,358,359 Isr: 359,358,361
Topic: __consumer_offsets Partition: 41 Leader: 359 Replicas: 362,359,360 Isr: 359,360
Topic: __consumer_offsets Partition: 42 Leader: 356 Replicas: 356,361,362 Isr: 356,362,361
Topic: __consumer_offsets Partition: 43 Leader: 357 Replicas: 357,362,356 Isr: 356,357,362
Topic: __consumer_offsets Partition: 44 Leader: 358 Replicas: 358,356,357 Isr: 356,358,357
Topic: __consumer_offsets Partition: 45 Leader: 359 Replicas: 359,357,358 Isr: 357,359,358
Topic: __consumer_offsets Partition: 46 Leader: 360 Replicas: 360,358,359 Isr: 359,358,360
Topic: __consumer_offsets Partition: 47 Leader: 361 Replicas: 361,359,360 Isr: 359,361,360
Topic: __consumer_offsets Partition: 48 Leader: 360 Replicas: 362,360,361 Isr: 360,361
Topic: __consumer_offsets Partition: 49 Leader: 356 Replicas: 356,362,357 Isr: 356,357,362

 

Also, found these errors:

[Controller id=356 epoch=41] Encountered error while electing leader for partition __consumer_offsets-12 due to: Preferred replica 361 for partition __consumer_offsets-12 is either not alive or not in the isr. Current leader and ISR: [{"leader":-1,"leader_epoch":32,"isr":[362]}]

[Controller id=356 epoch=41] Initiated state change for partition __consumer_offsets-12 from OfflinePartition to OnlinePartition failed
kafka.common.StateChangeFailedException: [Controller id=356 epoch=41] Encountered error while electing leader for partition __consumer_offsets-12 due to: Preferred replica 361 for partition __consumer_offsets-12 is either not alive or not in the isr. Current leader and ISR: [{"leader":-1,"leader_epoch":32,"isr":[362]}]
	at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:324)
	at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:163)
	at kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:110)
	at kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:109)
	at scala.collection.immutable.Set$Set1.foreach(Set.scala:94)
	at kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:109)
	at kafka.controller.KafkaController.onPreferredReplicaElection(KafkaController.scala:632)
	at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3$$anonfun$apply$13.apply(KafkaController.scala:1189)
	at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3$$anonfun$apply$13.apply(KafkaController.scala:1182)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
	at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
	at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
	at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
	at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:1182)
	at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:1169)
	at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
	at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
	at kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance(KafkaController.scala:1169)
	at kafka.controller.KafkaController$AutoPreferredReplicaLeaderElection$.process(KafkaController.scala:1449)
	at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:53)
	at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:53)
	at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:53)
	at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
	at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:52)
	at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:64)
Caused by: kafka.common.StateChangeFailedException: Preferred replica 361 for partition __consumer_offsets-12 is either not alive or not in the isr. Current leader and ISR: [{"leader":-1,"leader_epoch":32,"isr":[362]}]
	at kafka.controller.PreferredReplicaPartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:157)
	at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:303)
	... 25 more

 

Any pointers would be greatly appreciated.  And, I'm relatively new to Kafka and don't really know what/how to ask.

 

Thanks!

Announcements