Created on 05-21-2018 04:28 PM - edited 09-16-2022 06:15 AM
Hi,
Has anyone seen this error please let me know.
2018-05-21 22:56:20,126 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - Discovered coordinator ss879.xxx.xxx.xxx.com:9092 (id: 2144756551 rack: null) for group prod-abc-events.
2018-05-21 22:56:20,126 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - (Re-)joining group prod-abc-events
2018-05-21 22:56:20,126 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - Marking the coordinator ss879.xxx.xxx.xxx.com:9092 (id: 2144756551 rack: null) dead for group prod-abc-events
Created 05-22-2018 02:12 AM
This can occur for example when there are network communication errors between a consumer and the consumer group coordinator (a designated Kafka broker that is the leader for the underlying internal offset topic used for tracking consumers' progress in the consumer group). If that broker is down for some, consumer will mark it as dead. In this case a new consumer coordinator will be selected from the ISR set (assuming offsets.topic.replication.factor=3 and min.insync.replicas for the internal topic is 2).
Some questions:
- How did you configure for session.timeout.ms, heartbeat.interval.ms, request.timeout.ms?
- Does your consumer poll and send heart beat the coordinator Kafka broker on time?
- How do you assign partitions in your consumer group?
- How do you commit the offsets?
- Can you share the Kafka version you are using?
Created 05-22-2018 05:41 AM
- How did you configure for session.timeout.ms, heartbeat.interval.ms, request.timeout.ms?
session.timeout.ms - 20 seconds
heartbeat.interval.ms - 3 seconds
request.timeout.ms(broker) - 30 seconds
request.timeout.ms(connect) - 40 seconds
- min.insync.replicas is 1 in our cluster.
- Does your consumer poll and send heart beat the coordinator Kafka broker on time? Yes
- How do you assign partitions in your consumer group? - set a default in the cluster to 50 and topic auto creation enabled
- How do you commit the offsets? - Checking
- Can you share the Kafka version you are using? 0.11.0.1
We actually ran kafka-reassign-partitons and then the disks went full to 100% and some brokers went offline . so we had to stop the reasingment and then we started deleting excess partitions using
kafka-reassign-partitions --reassignment-json-file topics123.json --zookeeper xxxxx:2181/kafka --execute
The cluster was back online and all brokers are up with no underreplicated partitons.
Created 05-22-2018 08:02 AM
If min.insync.replicas is 1 and some brokers went offline, then it can be root cause of the issue (assuming minISR is 1 for assuming __consumer_offsets too). In this case, the broker is not alive that is the coordinator for the given consumer group (i.e. there is no partition leader for partition: XY of the internal __consumer_offsets topic).
Created on 05-22-2018 08:09 AM - edited 05-22-2018 08:14 AM
This the topic thats causing an issue.
desind@xxx:#> kafka-topics --describe --zookeeper xxx:2181/kafka --topic messages-events
Topic:messages-events PartitionCount:50 ReplicationFactor:3 Configs:retention.ms=86400000
Topic: messages-events Partition: 0 Leader: 155 Replicas: 155,97,98 Isr: 155,97,98
Topic: messages-events Partition: 1 Leader: 157 Replicas: 157,97,98 Isr: 157,97,98
Topic: messages-events Partition: 2 Leader: 156 Replicas: 156,98,154 Isr: 156,154,98
Topic: messages-events Partition: 3 Leader: 157 Replicas: 154,157,95 Isr: 154,157,95
Topic: messages-events Partition: 4 Leader: 96 Replicas: 96,155,157 Isr: 155,157,96
Topic: messages-events Partition: 5 Leader: 155 Replicas: 95,155,156 Isr: 156,155,95
Topic: messages-events Partition: 6 Leader: 98 Replicas: 98,158,95 Isr: 95,158,98
Topic: messages-events Partition: 7 Leader: 157 Replicas: 157,97,96 Isr: 157,96,97
Topic: messages-events Partition: 8 Leader: 95 Replicas: 95,98,158 Isr: 95,158,98
Topic: messages-events Partition: 9 Leader: 96 Replicas: 96,95,99 Isr: 95,96,99
Topic: messages-events Partition: 10 Leader: 157 Replicas: 157,97,98 Isr: 157,97,98
Topic: messages-events Partition: 11 Leader: 98 Replicas: 98,99,155 Isr: 155,98,99
Topic: messages-events Partition: 12 Leader: 95 Replicas: 95,154,156 Isr: 156,95,154
Topic: messages-events Partition: 13 Leader: 96 Replicas: 96,157,158 Isr: 157,158,96
Topic: messages-events Partition: 14 Leader: 155 Replicas: 95,155,156 Isr: 156,155,95
Topic: messages-events Partition: 15 Leader: 157 Replicas: 156,157,95 Isr: 156,157,95
Topic: messages-events Partition: 16 Leader: 97 Replicas: 97,99,158 Isr: 158,97,99
Topic: messages-events Partition: 17 Leader: 97 Replicas: 97,95,154 Isr: 95,154,97
Topic: messages-events Partition: 18 Leader: 98 Replicas: 98,96,95 Isr: 95,96,98
Topic: messages-events Partition: 19 Leader: 97 Replicas: 97,99,156 Isr: 156,97,99
Topic: messages-events Partition: 20 Leader: 98 Replicas: 98,99,154 Isr: 154,98,99
Topic: messages-events Partition: 21 Leader: 95 Replicas: 95,155,99 Isr: 155,95,99
Topic: messages-events Partition: 22 Leader: 96 Replicas: 96,158,95 Isr: 95,158,96
Topic: messages-events Partition: 23 Leader: 97 Replicas: 97,95,96 Isr: 95,96,97
Topic: messages-events Partition: 24 Leader: 98 Replicas: 98,96,97 Isr: 96,97,98
Topic: messages-events Partition: 25 Leader: 157 Replicas: 157,95,158 Isr: 157,158,95
Topic: messages-events Partition: 26 Leader: 96 Replicas: 96,95,158 Isr: 95,158,96
Topic: messages-events Partition: 27 Leader: 95 Replicas: 95,96,97 Isr: 95,96,97
Topic: messages-events Partition: 28 Leader: 157 Replicas: 157,155,158 Isr: 155,157,158
Topic: messages-events Partition: 29 Leader: 158 Replicas: 158,157,95 Isr: 157,158,95
Topic: messages-events Partition: 30 Leader: 95 Replicas: 95,158,96 Isr: 95,158,96
Topic: messages-events Partition: 31 Leader: 155 Replicas: 95,155,156 Isr: 156,155,95
Topic: messages-events Partition: 32 Leader: 97 Replicas: 97,96,98 Isr: 96,97,98
Topic: messages-events Partition: 33 Leader: 98 Replicas: 98,97,99 Isr: 98,97,99
Topic: messages-events Partition: 34 Leader: 157 Replicas: 154,157,95 Isr: 154,157,95
Topic: messages-events Partition: 35 Leader: 96 Replicas: 96,95,158 Isr: 95,158,96
Topic: messages-events Partition: 36 Leader: 95 Replicas: 95,96,97 Isr: 95,96,97
Topic: messages-events Partition: 37 Leader: 157 Replicas: 157,158,95 Isr: 157,158,95
Topic: messages-events Partition: 38 Leader: 158 Replicas: 158,95,96 Isr: 158,95,96
Topic: messages-events Partition: 39 Leader: 95 Replicas: 95,98,154 Isr: 95,154,98
Topic: messages-events Partition: 40 Leader: 96 Replicas: 96,97,98 Isr: 96,97,98
Topic: messages-events Partition: 41 Leader: 97 Replicas: 97,98,99 Isr: 97,98,99
Topic: messages-events Partition: 42 Leader: 98 Replicas: 98,99,154 Isr: 154,98,99
Topic: messages-events Partition: 43 Leader: 157 Replicas: 157,95,158 Isr: 157,158,95
Topic: messages-events Partition: 44 Leader: 95 Replicas: 95,96,154 Isr: 95,154,96
Topic: messages-events Partition: 45 Leader: 97 Replicas: 97,95,154 Isr: 95,154,97
Topic: messages-events Partition: 46 Leader: 95 Replicas: 95,98,96 Isr: 95,96,98
Topic: messages-events Partition: 47 Leader: 98 Replicas: 98,97,99 Isr: 98,97,99
Topic: messages-events Partition: 48 Leader: 95 Replicas: 95,98,154 Isr: 95,154,98
Topic: messages-events Partition: 49 Leader: 155 Replicas: 155,99,154 Isr: 155,154,99
###Consumer_offsets
Topic:__confluent.support.metrics PartitionCount:1 ReplicationFactor:3 Configs:leader.replication.throttled.replicas=0:99,0:95,0:96,follower.replication.throttled.replicas=0:155,0:156,retention.ms=31536000000
Topic: __confluent.support.metrics Partition: 0 Leader: 95 Replicas: 95,155,156 Isr: 95,155,156
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:3 Configs:segment.bytes=104857600,leader.replication.throttled.replicas=19:96,19:95,19:97,30:97,30:95,30:96,47:99,47:96,47:97,41:98,41:99,41:95,29:96,29:98,29:99,39:96,39:95,39:97,10:97,10:95,10:96,17:99,17:98,17:95,14:96,14:99,14:95,40:97,40:98,40:99,18:95,18:99,18:96,0:97,0:98,0:99,26:98,26:95,26:96,24:96,24:97,24:98,33:95,33:98,33:99,20:97,20:98,20:99,21:98,21:99,21:95,22:99,22:95,22:96,5:97,5:99,5:95,12:99,12:97,12:98,8:95,8:97,8:98,23:95,23:96,23:97,15:97,15:96,15:98,48:95,48:97,48:98,11:98,11:96,11:97,13:95,13:98,13:99,28:95,28:97,28:98,49:96,49:98,49:99,6:98,6:95,6:96,37:99,37:98,37:95,44:96,44:97,44:98,31:98,31:96,31:97,34:96,34:99,34:95,42:99,42:95,42:96,46:98,46:95,46:96,25:97,25:99,25:95,27:99,27:96,27:97,45:97,45:99,45:95,43:95,43:96,43:97,32:99,32:97,32:98,36:98,36:97,36:99,35:97,35:96,35:98,7:99,7:96,7:97,38:95,38:99,38:96,9:96,9:98,9:99,1:98,1:99,1:95,16:98,16:97,16:99,2:99,2:95,2:96,follower.replication.throttled.replicas=32:158,16:154,16:155,49:155,49:97,44:155,44:156,28:154,28:157,28:158,17:155,17:156,23:98,23:99,7:154,7:155,29:155,29:158,29:95,35:155,35:156,24:99,24:154,41:157,0:156,0:157,0:158,38:154,38:158,13:97,8:154,8:155,8:156,5:98,39:155,36:156,36:157,40:156,45:156,45:157,15:99,15:154,33:154,37:157,37:158,21:157,21:96,21:97,6:99,6:154,11:157,11:95,20:156,20:95,20:96,47:158,47:95,2:158,27:156,27:157,34:154,34:155,9:155,9:156,9:157,22:158,22:97,22:98,42:158,42:154,14:98,25:154,25:155,10:156,10:158,48:154,48:96,31:157,18:154,18:156,18:157,19:155,19:157,19:158,12:158,12:96,46:157,46:158,43:154,43:155,1:157,1:158,26:155,26:156,30:156,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 156 Replicas: 156,157,158 Isr: 156,157,158
Topic: __consumer_offsets Partition: 1 Leader: 157 Replicas: 157,158,95 Isr: 157,158,95
Topic: __consumer_offsets Partition: 2 Leader: 158 Replicas: 158,95,96 Isr: 158,95,96
Topic: __consumer_offsets Partition: 3 Leader: 95 Replicas: 95,96,97 Isr: 95,96,97
Topic: __consumer_offsets Partition: 4 Leader: 96 Replicas: 96,97,98 Isr: 96,97,98
Topic: __consumer_offsets Partition: 5 Leader: 97 Replicas: 97,98,99 Isr: 97,98,99
Topic: __consumer_offsets Partition: 6 Leader: 98 Replicas: 98,99,154 Isr: 154,98,99
Topic: __consumer_offsets Partition: 7 Leader: 99 Replicas: 99,154,155 Isr: 155,154,99
Topic: __consumer_offsets Partition: 8 Leader: 154 Replicas: 154,155,156 Isr: 154,155,156
Topic: __consumer_offsets Partition: 9 Leader: 155 Replicas: 155,156,157 Isr: 155,156,157
Topic: __consumer_offsets Partition: 10 Leader: 156 Replicas: 156,158,95 Isr: 95,156,158
Topic: __consumer_offsets Partition: 11 Leader: 157 Replicas: 157,95,96 Isr: 157,95,96
Topic: __consumer_offsets Partition: 12 Leader: 158 Replicas: 158,96,97 Isr: 158,96,97
Topic: __consumer_offsets Partition: 13 Leader: 95 Replicas: 95,97,98 Isr: 95,97,98
Topic: __consumer_offsets Partition: 14 Leader: 96 Replicas: 96,98,99 Isr: 96,98,99
Topic: __consumer_offsets Partition: 15 Leader: 97 Replicas: 97,99,154 Isr: 154,97,99
Topic: __consumer_offsets Partition: 16 Leader: 98 Replicas: 98,154,155 Isr: 155,154,98
Topic: __consumer_offsets Partition: 17 Leader: 99 Replicas: 99,155,156 Isr: 155,156,99
Topic: __consumer_offsets Partition: 18 Leader: 154 Replicas: 154,156,157 Isr: 154,156,157
Topic: __consumer_offsets Partition: 19 Leader: 155 Replicas: 155,157,158 Isr: 155,157,158
Topic: __consumer_offsets Partition: 20 Leader: 156 Replicas: 156,95,96 Isr: 95,156,96
Topic: __consumer_offsets Partition: 21 Leader: 157 Replicas: 157,96,97 Isr: 157,96,97
Topic: __consumer_offsets Partition: 22 Leader: 158 Replicas: 158,97,98 Isr: 158,97,98
Topic: __consumer_offsets Partition: 23 Leader: 95 Replicas: 95,98,99 Isr: 95,98,99
Topic: __consumer_offsets Partition: 24 Leader: 96 Replicas: 96,99,154 Isr: 154,96,99
Topic: __consumer_offsets Partition: 25 Leader: 97 Replicas: 97,154,155 Isr: 154,155,97
Topic: __consumer_offsets Partition: 26 Leader: 98 Replicas: 98,155,156 Isr: 155,156,98
Topic: __consumer_offsets Partition: 27 Leader: 99 Replicas: 99,156,157 Isr: 156,157,99
Topic: __consumer_offsets Partition: 28 Leader: 154 Replicas: 154,157,158 Isr: 154,157,158
Topic: __consumer_offsets Partition: 29 Leader: 155 Replicas: 155,158,95 Isr: 155,95,158
Topic: __consumer_offsets Partition: 30 Leader: 156 Replicas: 156,96,97 Isr: 156,96,97
Topic: __consumer_offsets Partition: 31 Leader: 157 Replicas: 157,97,98 Isr: 157,97,98
Topic: __consumer_offsets Partition: 32 Leader: 158 Replicas: 158,98,99 Isr: 158,98,99
Topic: __consumer_offsets Partition: 33 Leader: 95 Replicas: 95,99,154 Isr: 95,154,99
Topic: __consumer_offsets Partition: 34 Leader: 96 Replicas: 96,154,155 Isr: 155,154,96
Topic: __consumer_offsets Partition: 35 Leader: 97 Replicas: 97,155,156 Isr: 156,155,97
Topic: __consumer_offsets Partition: 36 Leader: 98 Replicas: 98,156,157 Isr: 156,157,98
Topic: __consumer_offsets Partition: 37 Leader: 99 Replicas: 99,157,158 Isr: 157,158,99
Topic: __consumer_offsets Partition: 38 Leader: 154 Replicas: 154,158,95 Isr: 154,95,158
Topic: __consumer_offsets Partition: 39 Leader: 155 Replicas: 155,95,96 Isr: 155,95,96
Topic: __consumer_offsets Partition: 40 Leader: 156 Replicas: 156,97,98 Isr: 156,97,98
Topic: __consumer_offsets Partition: 41 Leader: 157 Replicas: 157,98,99 Isr: 157,98,99
Topic: __consumer_offsets Partition: 42 Leader: 158 Replicas: 158,99,154 Isr: 154,158,99
Topic: __consumer_offsets Partition: 43 Leader: 95 Replicas: 95,154,155 Isr: 95,154,155
Topic: __consumer_offsets Partition: 44 Leader: 96 Replicas: 96,155,156 Isr: 155,156,96
Topic: __consumer_offsets Partition: 45 Leader: 97 Replicas: 97,156,157 Isr: 156,157,97
Topic: __consumer_offsets Partition: 46 Leader: 98 Replicas: 98,157,158 Isr: 157,158,98
Topic: __consumer_offsets Partition: 47 Leader: 99 Replicas: 99,158,95 Isr: 95,158,99
Topic: __consumer_offsets Partition: 48 Leader: 154 Replicas: 154,95,96 Isr: 154,95,96
Topic: __consumer_offsets Partition: 49 Leader: 155 Replicas: 155,96,97 Isr: 155,96,97
offsets.topic.replication.factor - 3 in the cluster .
The leader and preferred replica are not the same for some partitions for this topic is that the issue
What is the best course of action next ? can we drain all messages from this topic ?
Created 05-23-2018 01:14 AM
Created 05-23-2018 06:28 AM
prod-abc-events
FAILS (the session is hung)
kafka-console-consumer -bootstrap-server xxxx:9092 --topic messages-events --consumer-property group.id=prod-abc-events
WORKS
kafka-console-consumer -bootstrap-server xxxx:9092 --topic messages-events --consumer-property group.id=test-id
so when i use consumer group name "prod-abc-events" name it fails.
Created 05-23-2018 07:29 AM
I'm glad to hear you were able to drain messages with the new consumer group.
Does it fail with the same reason (coordinator dead)? Please note consumers in the "prod-abc-events" consumer group have already established offset to consume from; if there are no new messages produced, they would look like as if they were hanging.
Actually coordinator for the consumer group / designated broker is derived from the group.id (note: in the consumer, the request is sent from sendGroupCoordinatorRequest()). So, the second time you start the consumer with the same group id, then it would go to the same broker. If you don't specify group.id for the kafka-console-consumer, it will be generated.
Created 05-23-2018 08:22 AM
Yes. after draining the topic completely we still see this error
018-05-23 15:19:49,449 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - Discovered coordinator 315.xxx.com:9092 (id: 2147483551 rack: null) for group prod-abc-events.
2018-05-23 15:19:49,449 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - (Re-)joining group prod-abc-events
2018-05-23 15:19:49,450 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - Marking the coordinator 315.xxx.com:9092 (id: 2147483551 rack: null) dead for group prod-abc-events
What do you suggest that we do next ?
one thing i can think off is to restart the producer .
Created 05-24-2018 04:31 AM
What are the implications of deleting a .log file which has the consume group from the "_-consumer_offsets" topic ?