Support Questions
Find answers, ask questions, and share your expertise

[HDP-2.3.2] Kafka brokers(topics) suddenly becomes unavailable.

Expert Contributor

After HDP rolling upgrade, we're currently experiencing this weird event on our kafka cluster.

Everything is fine before the upgrade.

After the upgrade, we're hitting this issue when all kafka brokers/topics suddenly becomes unavailable.

When checking on zk, it shows that brokers were unregistered. Logs also showing "cached zkversion is not equal to that in zookeeper"

Our quick fix was to restart zk and kafka. Already happened twice on our prod env. 😞 #dataloss

Does anyone have this same issue?

Please help.

5 REPLIES 5

Expert Contributor

I can see logs on server.log

[2016-03-06 13:04:24,414] INFO Partition [NSN_IN_RECHARGE2,0] on broker 1001: Shrinking ISR for partition [NSN_IN_RECHARGE2,0] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,417] INFO Partition [NSN_IN_ZEROEXP,2] on broker 1001: Shrinking ISR for partition [NSN_IN_ZEROEXP,2] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,419] INFO Partition [NSN_IN_DATA2,0] on broker 1001: Shrinking ISR for partition [NSN_IN_DATA2,0] from 1002,1001 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,421] INFO Partition [NSN_IN_AIRTIMEREL,1] on broker 1001: Shrinking ISR for partition [NSN_IN_AIRTIMEREL,1] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,423] INFO Partition [RESPONSE_SMS,0] on broker 1001: Shrinking ISR for partition [RESPONSE_SMS,0] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,425] INFO Partition [MEF1.1,3] on broker 1001: Shrinking ISR for partition [MEF1.1,3] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,434] INFO Partition [NSN_IN_AIRTIMEREL,5] on broker 1001: Shrinking ISR for partition [NSN_IN_AIRTIMEREL,5] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,436] INFO Partition [TNT_IN_DATA,0] on broker 1001: Shrinking ISR for partition [TNT_IN_DATA,0] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,438] INFO Partition [URM_MWTRF,0] on broker 1001: Shrinking ISR for partition [URM_MWTRF,0] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,439] INFO Partition [AUDIT_SPARK_INGEST,1] on broker 1001: Shrinking ISR for partition [AUDIT_SPARK_INGEST,1] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,441] INFO Partition [TNT_IN_VOU,0] on broker 1001: Shrinking ISR for partition [TNT_IN_VOU,0] from 1001,1003 to 1001 (kafka.cluster.Partition)
[2016-03-06 13:04:24,443] INFO Partition [MEF1.1-SSQCx,2] on broker 1001: Shrinking ISR for partition [MEF1.1-SSQCx,2] from 1001,1003 to 1001 (kafka.cluster.Partition)

Explorer

Looks like you might be hitting KAFKA-2729 and KAFKA-3042. It seems like after a controller failover it is possible that the metadata cache does not include the leader details in the live brokers information and that causes the follower to error.

Expert Contributor

Thanks for the link @Alberto Romero, will check it later.

Hi @Michael Dennis Uanang, have you resolved this? We recently did a RU of a cluster with Kafka, and ended up manually changing Broker IDs in meta.properties on each broker's volume from 1001, 1002, ... to 0, 1, 2, ... Also, if you are using a custom port (different from default 6667) make sure it's set in the "listeners" property.

Expert Contributor

Manually changing the brokerid will also do.

For the ISR shrinking, we adjusted the replica settings on brokers so that all brokers are in-sync (ISR). We also adjusted our heapsize from 1G to 2G for brokers. :) Thanks!