Support Questions

Find answers, ask questions, and share your expertise

Error in kafka consumer

avatar
Expert Contributor

Hi, 

 

Has anyone seen this error please let me know. 

 

2018-05-21 22:56:20,126 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - Discovered coordinator ss879.xxx.xxx.xxx.com:9092 (id: 2144756551 rack: null) for group prod-abc-events.
2018-05-21 22:56:20,126 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - (Re-)joining group prod-abc-events

2018-05-21 22:56:20,126 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - Marking the coordinator ss879.xxx.xxx.xxx.com:9092 (id: 2144756551 rack: null) dead for group prod-abc-events

 

13 REPLIES 13

avatar
Cloudera Employee

This can occur for example when there are network communication errors between a consumer and the consumer group coordinator (a designated Kafka broker that is the leader for the underlying internal offset topic used for tracking consumers' progress in the consumer group). If that broker is down for some, consumer will mark it as dead. In this case a new consumer coordinator will be selected from the ISR set (assuming offsets.topic.replication.factor=3 and min.insync.replicas for the internal topic is 2).   

Some questions:
- How did you configure for session.timeout.ms, heartbeat.interval.ms, request.timeout.ms?
- Does your consumer poll and send heart beat the coordinator Kafka broker on time? 
- How do you assign partitions in your consumer group?
- How do you commit the offsets?
- Can you share the Kafka version you are using?

avatar
Expert Contributor

- How did you configure for session.timeout.ms, heartbeat.interval.ms, request.timeout.ms?

session.timeout.ms - 20 seconds

heartbeat.interval.ms - 3 seconds

request.timeout.ms(broker) - 30 seconds

request.timeout.ms(connect) - 40 seconds 

 


- min.insync.replicas  is 1 in our cluster. 

- Does your consumer poll and send heart beat the coordinator Kafka broker on time? Yes
- How do you assign partitions in your consumer group? - set a default in the cluster to 50 and topic auto creation enabled 
- How do you commit the offsets? - Checking
- Can you share the Kafka version you are using? 0.11.0.1

 

We actually ran kafka-reassign-partitons and then the disks went full to 100% and some brokers went offline . so we had to stop the reasingment and then we started deleting excess partitions using 

kafka-reassign-partitions --reassignment-json-file topics123.json --zookeeper xxxxx:2181/kafka --execute

 

The cluster was back online and all brokers are up with no underreplicated partitons. 

 

avatar
Cloudera Employee

If min.insync.replicas is 1 and some brokers went offline, then it can be root cause of the issue (assuming minISR is 1 for  assuming __consumer_offsets too). In this case, the broker is not alive that is the coordinator for the given consumer group (i.e. there is no partition leader for partition: XY of the internal __consumer_offsets topic).


- Can you verify this by running kafka-topics --describe --zookeeper $(hostname -f):2181  
- What is offsets.topic.replication.factor set to for__consumer_offsets? It is recommended to set it to 3.

avatar
Expert Contributor

This the topic thats causing an issue. 

 

desind@xxx:#> kafka-topics --describe --zookeeper xxx:2181/kafka --topic messages-events
Topic:messages-events PartitionCount:50 ReplicationFactor:3 Configs:retention.ms=86400000
Topic: messages-events Partition: 0 Leader: 155 Replicas: 155,97,98 Isr: 155,97,98
Topic: messages-events Partition: 1 Leader: 157 Replicas: 157,97,98 Isr: 157,97,98
Topic: messages-events Partition: 2 Leader: 156 Replicas: 156,98,154 Isr: 156,154,98
Topic: messages-events Partition: 3 Leader: 157 Replicas: 154,157,95 Isr: 154,157,95
Topic: messages-events Partition: 4 Leader: 96 Replicas: 96,155,157 Isr: 155,157,96
Topic: messages-events Partition: 5 Leader: 155 Replicas: 95,155,156 Isr: 156,155,95
Topic: messages-events Partition: 6 Leader: 98 Replicas: 98,158,95 Isr: 95,158,98
Topic: messages-events Partition: 7 Leader: 157 Replicas: 157,97,96 Isr: 157,96,97
Topic: messages-events Partition: 8 Leader: 95 Replicas: 95,98,158 Isr: 95,158,98
Topic: messages-events Partition: 9 Leader: 96 Replicas: 96,95,99 Isr: 95,96,99
Topic: messages-events Partition: 10 Leader: 157 Replicas: 157,97,98 Isr: 157,97,98
Topic: messages-events Partition: 11 Leader: 98 Replicas: 98,99,155 Isr: 155,98,99
Topic: messages-events Partition: 12 Leader: 95 Replicas: 95,154,156 Isr: 156,95,154
Topic: messages-events Partition: 13 Leader: 96 Replicas: 96,157,158 Isr: 157,158,96
Topic: messages-events Partition: 14 Leader: 155 Replicas: 95,155,156 Isr: 156,155,95
Topic: messages-events Partition: 15 Leader: 157 Replicas: 156,157,95 Isr: 156,157,95
Topic: messages-events Partition: 16 Leader: 97 Replicas: 97,99,158 Isr: 158,97,99
Topic: messages-events Partition: 17 Leader: 97 Replicas: 97,95,154 Isr: 95,154,97
Topic: messages-events Partition: 18 Leader: 98 Replicas: 98,96,95 Isr: 95,96,98
Topic: messages-events Partition: 19 Leader: 97 Replicas: 97,99,156 Isr: 156,97,99
Topic: messages-events Partition: 20 Leader: 98 Replicas: 98,99,154 Isr: 154,98,99
Topic: messages-events Partition: 21 Leader: 95 Replicas: 95,155,99 Isr: 155,95,99
Topic: messages-events Partition: 22 Leader: 96 Replicas: 96,158,95 Isr: 95,158,96
Topic: messages-events Partition: 23 Leader: 97 Replicas: 97,95,96 Isr: 95,96,97
Topic: messages-events Partition: 24 Leader: 98 Replicas: 98,96,97 Isr: 96,97,98
Topic: messages-events Partition: 25 Leader: 157 Replicas: 157,95,158 Isr: 157,158,95
Topic: messages-events Partition: 26 Leader: 96 Replicas: 96,95,158 Isr: 95,158,96
Topic: messages-events Partition: 27 Leader: 95 Replicas: 95,96,97 Isr: 95,96,97
Topic: messages-events Partition: 28 Leader: 157 Replicas: 157,155,158 Isr: 155,157,158
Topic: messages-events Partition: 29 Leader: 158 Replicas: 158,157,95 Isr: 157,158,95
Topic: messages-events Partition: 30 Leader: 95 Replicas: 95,158,96 Isr: 95,158,96
Topic: messages-events Partition: 31 Leader: 155 Replicas: 95,155,156 Isr: 156,155,95
Topic: messages-events Partition: 32 Leader: 97 Replicas: 97,96,98 Isr: 96,97,98
Topic: messages-events Partition: 33 Leader: 98 Replicas: 98,97,99 Isr: 98,97,99
Topic: messages-events Partition: 34 Leader: 157 Replicas: 154,157,95 Isr: 154,157,95
Topic: messages-events Partition: 35 Leader: 96 Replicas: 96,95,158 Isr: 95,158,96
Topic: messages-events Partition: 36 Leader: 95 Replicas: 95,96,97 Isr: 95,96,97
Topic: messages-events Partition: 37 Leader: 157 Replicas: 157,158,95 Isr: 157,158,95
Topic: messages-events Partition: 38 Leader: 158 Replicas: 158,95,96 Isr: 158,95,96
Topic: messages-events Partition: 39 Leader: 95 Replicas: 95,98,154 Isr: 95,154,98
Topic: messages-events Partition: 40 Leader: 96 Replicas: 96,97,98 Isr: 96,97,98
Topic: messages-events Partition: 41 Leader: 97 Replicas: 97,98,99 Isr: 97,98,99
Topic: messages-events Partition: 42 Leader: 98 Replicas: 98,99,154 Isr: 154,98,99
Topic: messages-events Partition: 43 Leader: 157 Replicas: 157,95,158 Isr: 157,158,95
Topic: messages-events Partition: 44 Leader: 95 Replicas: 95,96,154 Isr: 95,154,96
Topic: messages-events Partition: 45 Leader: 97 Replicas: 97,95,154 Isr: 95,154,97
Topic: messages-events Partition: 46 Leader: 95 Replicas: 95,98,96 Isr: 95,96,98
Topic: messages-events Partition: 47 Leader: 98 Replicas: 98,97,99 Isr: 98,97,99
Topic: messages-events Partition: 48 Leader: 95 Replicas: 95,98,154 Isr: 95,154,98
Topic: messages-events Partition: 49 Leader: 155 Replicas: 155,99,154 Isr: 155,154,99

 

###Consumer_offsets

 

Topic:__confluent.support.metrics PartitionCount:1 ReplicationFactor:3 Configs:leader.replication.throttled.replicas=0:99,0:95,0:96,follower.replication.throttled.replicas=0:155,0:156,retention.ms=31536000000
Topic: __confluent.support.metrics Partition: 0 Leader: 95 Replicas: 95,155,156 Isr: 95,155,156
Topic:__consumer_offsets PartitionCount:50 ReplicationFactor:3 Configs:segment.bytes=104857600,leader.replication.throttled.replicas=19:96,19:95,19:97,30:97,30:95,30:96,47:99,47:96,47:97,41:98,41:99,41:95,29:96,29:98,29:99,39:96,39:95,39:97,10:97,10:95,10:96,17:99,17:98,17:95,14:96,14:99,14:95,40:97,40:98,40:99,18:95,18:99,18:96,0:97,0:98,0:99,26:98,26:95,26:96,24:96,24:97,24:98,33:95,33:98,33:99,20:97,20:98,20:99,21:98,21:99,21:95,22:99,22:95,22:96,5:97,5:99,5:95,12:99,12:97,12:98,8:95,8:97,8:98,23:95,23:96,23:97,15:97,15:96,15:98,48:95,48:97,48:98,11:98,11:96,11:97,13:95,13:98,13:99,28:95,28:97,28:98,49:96,49:98,49:99,6:98,6:95,6:96,37:99,37:98,37:95,44:96,44:97,44:98,31:98,31:96,31:97,34:96,34:99,34:95,42:99,42:95,42:96,46:98,46:95,46:96,25:97,25:99,25:95,27:99,27:96,27:97,45:97,45:99,45:95,43:95,43:96,43:97,32:99,32:97,32:98,36:98,36:97,36:99,35:97,35:96,35:98,7:99,7:96,7:97,38:95,38:99,38:96,9:96,9:98,9:99,1:98,1:99,1:95,16:98,16:97,16:99,2:99,2:95,2:96,follower.replication.throttled.replicas=32:158,16:154,16:155,49:155,49:97,44:155,44:156,28:154,28:157,28:158,17:155,17:156,23:98,23:99,7:154,7:155,29:155,29:158,29:95,35:155,35:156,24:99,24:154,41:157,0:156,0:157,0:158,38:154,38:158,13:97,8:154,8:155,8:156,5:98,39:155,36:156,36:157,40:156,45:156,45:157,15:99,15:154,33:154,37:157,37:158,21:157,21:96,21:97,6:99,6:154,11:157,11:95,20:156,20:95,20:96,47:158,47:95,2:158,27:156,27:157,34:154,34:155,9:155,9:156,9:157,22:158,22:97,22:98,42:158,42:154,14:98,25:154,25:155,10:156,10:158,48:154,48:96,31:157,18:154,18:156,18:157,19:155,19:157,19:158,12:158,12:96,46:157,46:158,43:154,43:155,1:157,1:158,26:155,26:156,30:156,cleanup.policy=compact,compression.type=producer
Topic: __consumer_offsets Partition: 0 Leader: 156 Replicas: 156,157,158 Isr: 156,157,158
Topic: __consumer_offsets Partition: 1 Leader: 157 Replicas: 157,158,95 Isr: 157,158,95
Topic: __consumer_offsets Partition: 2 Leader: 158 Replicas: 158,95,96 Isr: 158,95,96
Topic: __consumer_offsets Partition: 3 Leader: 95 Replicas: 95,96,97 Isr: 95,96,97
Topic: __consumer_offsets Partition: 4 Leader: 96 Replicas: 96,97,98 Isr: 96,97,98
Topic: __consumer_offsets Partition: 5 Leader: 97 Replicas: 97,98,99 Isr: 97,98,99
Topic: __consumer_offsets Partition: 6 Leader: 98 Replicas: 98,99,154 Isr: 154,98,99
Topic: __consumer_offsets Partition: 7 Leader: 99 Replicas: 99,154,155 Isr: 155,154,99
Topic: __consumer_offsets Partition: 8 Leader: 154 Replicas: 154,155,156 Isr: 154,155,156
Topic: __consumer_offsets Partition: 9 Leader: 155 Replicas: 155,156,157 Isr: 155,156,157
Topic: __consumer_offsets Partition: 10 Leader: 156 Replicas: 156,158,95 Isr: 95,156,158
Topic: __consumer_offsets Partition: 11 Leader: 157 Replicas: 157,95,96 Isr: 157,95,96
Topic: __consumer_offsets Partition: 12 Leader: 158 Replicas: 158,96,97 Isr: 158,96,97
Topic: __consumer_offsets Partition: 13 Leader: 95 Replicas: 95,97,98 Isr: 95,97,98
Topic: __consumer_offsets Partition: 14 Leader: 96 Replicas: 96,98,99 Isr: 96,98,99
Topic: __consumer_offsets Partition: 15 Leader: 97 Replicas: 97,99,154 Isr: 154,97,99
Topic: __consumer_offsets Partition: 16 Leader: 98 Replicas: 98,154,155 Isr: 155,154,98
Topic: __consumer_offsets Partition: 17 Leader: 99 Replicas: 99,155,156 Isr: 155,156,99
Topic: __consumer_offsets Partition: 18 Leader: 154 Replicas: 154,156,157 Isr: 154,156,157
Topic: __consumer_offsets Partition: 19 Leader: 155 Replicas: 155,157,158 Isr: 155,157,158
Topic: __consumer_offsets Partition: 20 Leader: 156 Replicas: 156,95,96 Isr: 95,156,96
Topic: __consumer_offsets Partition: 21 Leader: 157 Replicas: 157,96,97 Isr: 157,96,97
Topic: __consumer_offsets Partition: 22 Leader: 158 Replicas: 158,97,98 Isr: 158,97,98
Topic: __consumer_offsets Partition: 23 Leader: 95 Replicas: 95,98,99 Isr: 95,98,99
Topic: __consumer_offsets Partition: 24 Leader: 96 Replicas: 96,99,154 Isr: 154,96,99
Topic: __consumer_offsets Partition: 25 Leader: 97 Replicas: 97,154,155 Isr: 154,155,97
Topic: __consumer_offsets Partition: 26 Leader: 98 Replicas: 98,155,156 Isr: 155,156,98
Topic: __consumer_offsets Partition: 27 Leader: 99 Replicas: 99,156,157 Isr: 156,157,99
Topic: __consumer_offsets Partition: 28 Leader: 154 Replicas: 154,157,158 Isr: 154,157,158
Topic: __consumer_offsets Partition: 29 Leader: 155 Replicas: 155,158,95 Isr: 155,95,158
Topic: __consumer_offsets Partition: 30 Leader: 156 Replicas: 156,96,97 Isr: 156,96,97
Topic: __consumer_offsets Partition: 31 Leader: 157 Replicas: 157,97,98 Isr: 157,97,98
Topic: __consumer_offsets Partition: 32 Leader: 158 Replicas: 158,98,99 Isr: 158,98,99
Topic: __consumer_offsets Partition: 33 Leader: 95 Replicas: 95,99,154 Isr: 95,154,99
Topic: __consumer_offsets Partition: 34 Leader: 96 Replicas: 96,154,155 Isr: 155,154,96
Topic: __consumer_offsets Partition: 35 Leader: 97 Replicas: 97,155,156 Isr: 156,155,97
Topic: __consumer_offsets Partition: 36 Leader: 98 Replicas: 98,156,157 Isr: 156,157,98
Topic: __consumer_offsets Partition: 37 Leader: 99 Replicas: 99,157,158 Isr: 157,158,99
Topic: __consumer_offsets Partition: 38 Leader: 154 Replicas: 154,158,95 Isr: 154,95,158
Topic: __consumer_offsets Partition: 39 Leader: 155 Replicas: 155,95,96 Isr: 155,95,96
Topic: __consumer_offsets Partition: 40 Leader: 156 Replicas: 156,97,98 Isr: 156,97,98
Topic: __consumer_offsets Partition: 41 Leader: 157 Replicas: 157,98,99 Isr: 157,98,99
Topic: __consumer_offsets Partition: 42 Leader: 158 Replicas: 158,99,154 Isr: 154,158,99
Topic: __consumer_offsets Partition: 43 Leader: 95 Replicas: 95,154,155 Isr: 95,154,155
Topic: __consumer_offsets Partition: 44 Leader: 96 Replicas: 96,155,156 Isr: 155,156,96
Topic: __consumer_offsets Partition: 45 Leader: 97 Replicas: 97,156,157 Isr: 156,157,97
Topic: __consumer_offsets Partition: 46 Leader: 98 Replicas: 98,157,158 Isr: 157,158,98
Topic: __consumer_offsets Partition: 47 Leader: 99 Replicas: 99,158,95 Isr: 95,158,99
Topic: __consumer_offsets Partition: 48 Leader: 154 Replicas: 154,95,96 Isr: 154,95,96
Topic: __consumer_offsets Partition: 49 Leader: 155 Replicas: 155,96,97 Isr: 155,96,97

 

offsets.topic.replication.factor - 3 in the cluster .

The leader and preferred replica are not the same for some partitions for this topic is that the issue

 

What is the best course of action next ? can we drain all messages from this topic ? 

avatar
Cloudera Employee
  •  "The cluster was back online and all brokers are up with no underreplicated partitons."
    Is the problem still present (consumers mark coordinators as dead) or it was being observed only during the time period while the brokers were offline (around 2018-05-21 22:56:20,126) .
  • "can we drain all messages from this topic ? "
    If you create a new consumer group, you shall be able to poll all messages (quick test: kafka-console-consumer --group test_group_id --bootstrap-server $(hostname):9092 --from-beginning  --topic messages-events ). If you use the new Java client / KafkaConsumer in your consumers, you can also seek() to a given offset and start consuming messages from that point. 

avatar
Expert Contributor

 prod-abc-events

 

FAILS (the session is hung)

 

kafka-console-consumer -bootstrap-server xxxx:9092 --topic messages-events --consumer-property group.id=prod-abc-events

 

WORKS

kafka-console-consumer -bootstrap-server xxxx:9092 --topic messages-events --consumer-property group.id=test-id

 

so when i use consumer group name "prod-abc-events" name it fails. 

avatar
Cloudera Employee

I'm glad to hear you were able to drain messages with the new consumer group.

Does it fail with the same reason (coordinator dead)? Please note consumers in the "prod-abc-events" consumer group have already established offset to consume from; if there are no new messages produced, they would look like as if they were hanging.

Actually coordinator for the consumer group / designated broker is derived from the group.id (note: in the consumer, the request is sent from sendGroupCoordinatorRequest()). So, the second time you start the consumer with the same group id, then it would go to the same broker. If you don't specify group.id for the kafka-console-consumer, it will be generated.

avatar
Expert Contributor

Yes. after draining the topic completely we still see this error

 

018-05-23 15:19:49,449 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - Discovered coordinator 315.xxx.com:9092 (id: 2147483551 rack: null) for group prod-abc-events.
2018-05-23 15:19:49,449 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - (Re-)joining group prod-abc-events
2018-05-23 15:19:49,450 INFO adPoolTaskExecutor-1 s.consumer.internals.AbstractCoordinator - Marking the coordinator 315.xxx.com:9092 (id: 2147483551 rack: null) dead for group prod-abc-events

 

What do you suggest that we do next ? 

one thing i can think off is to restart the producer . 

avatar
Expert Contributor

What are the implications of deleting a .log file which has the consume group from the "_-consumer_offsets" topic ?