Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Kafka Replica out-of-sync for over 24 hrs

avatar
Expert Contributor

Is there a way that I can force the replica to catch up the leader? The replica has been out of sync for over 24 hrs. Tried restarting and i dont see any movement. Tried moving replica to a different brokers it does not work reassignment stuck . Created an additional replica and that command also stuck waiting for the out-of-sync to catch up to leader.

 

Unclean.leader.election is enabled in the cluster 

 

Logs: 

 

ERROR kafka.server.ReplicaFetcherThread: [ReplicaFetcher replicaId=99, leaderId=157, fetcherId=0] Error due to
kafka.common.KafkaException: Error processing data for partition dev-raw-events-35 offset 111478948 Caused by: kafka.common.UnexpectedAppendOffsetException: Unexpected offset in append to dev-raw-events-35. First offset or last offset of the first batch 111478933 is less than the next offset 111478948. First 10 offsets in append: List(111478933, 111478934, 111478935, 111478936, 111478937, 111478938, 111478939, 111478940, 111478941, 111478942), last offset in append: 111479224. Log start offset = 95104666

 

Tried restarting the broker and the under-replicated partitions change 

Tried moving to another node and it was uncessfull 

Tried creating a new replica and kafka-reassign-partitions is stuck waiting for the out of sync to catch up 

 

What can i do to fix this issue ? 

7 REPLIES 7

avatar
Rising Star

Hi desind,

I'm not sure there's a way to force it so sync here. From what you're describing and the error you shared, I think what's happening here is that the replica fetcher thread fails and the broker stops replicating data from the leader. That would explain why you see the broker out of sync for a long time. 

Are you using Cloudera's distribution of Kafka or is this Apache Kafka? 

What version are you using?

I see that someone reported a similar issue very recently:
https://issues.apache.org/jira/browse/KAFKA-7635

avatar
Expert Contributor

I am using Apache Kafka , Version 1.1.1

 

avatar
Rising Star

It's a fairly new issue that I personally haven't seen before with any of the current customers running on the Cloudera Distribution of Kafka, but the latest versions released (Cloudera Distribution of Kafka 3.1.1) and Kafa in CDH 6.0 is based on Apache Kafka 1.0.1. The plan for CDH 6.1 is to rebase Cloudera Kafka to Apache Kafka 2.0, so it's probably just a matter of time till this becomes a more common issue. 

You mentioned that restarting the Kafka service would then cause the problematic partitions to change. Is that the case when you only shutdown a single broker and start it up again? I'm asking because one potential way to work around this is to identify which broker is lagging behind and not joining the ISR, shutdown the broker, delete the topic partition data (for the affected partitions) from disk and then start up the broker again. 

The broker will start and self heal by replicating all the data from the current leader of those partitions. Obviously this can take a long time depending on how many partitions are affected and how much data needs to be replicated.

avatar
Expert Contributor

I agree with your suggestion and we are in the process of testing this in staging. We unfortunately dont want to try this on the problematic cluster which is production as we might corrupt something. 

 

its hard to replicate the issue in staging environment, atleast try to do rm -rf for one replica and restart broker and see how it would behave. 

 

After doing some research this is the issue we are facing. https://issues.apache.org/jira/browse/KAFKA-6361. 

avatar
Rising Star

Just to be clear, you're only deleting data for the specific partitions that are impacted and not everything under the broker's data directory. I just wasn't sure what you meant by rm -rf here so wanted to clarify. 

Good luck, and please do let us know of the outcome.

avatar
Expert Contributor

Yes we only tried deleting the out-of-sync partition. It did not work. 

After a lot of research we came to a conclusion to increase replica.lag.time.max.ms to 8 days. As its been around 8 days that a few replicas were out of sync. 

This resolved our issue and while it took a few hours for followers to fetch and replicate the 7 days of data. 

 

https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/  helped to understand the ISR's 

avatar
Champion

Did you had a chance to put a fix for this issue ? Or did you follow the work around stated by w@leed . 

Curious to know .