Kafka Stream reprocessing old messages on rebalancing - apache-kafka

I have a Kafka Streams application which reads data from a few topics, joins the data and writes it to another topic.
This is the configuration of my Kafka cluster:
5 Kafka brokers
Kafka topics - 15 partitions and replication factor 3.
My Kafka Streams applications are running on the same machines as my Kafka broker.
A few million records are consumed/produced per hour. Whenever I take a broker down, the application goes into rebalancing state and after rebalancing many times it starts consuming very old messages.
Note: When the Kafka Streams application was running fine, its consumer lag was almost 0. But after rebalancing, its lag went from 0 to 10million.
Can this be because of offset.retention.minutes.
This is the log and offset retention policy configuration of my Kafka broker:
log retention policy : 3 days
offset.retention.minutes : 1 day
In the below link I read that this could be the cause:
Offset Retention Minutes reference
Any help in this would be appreciated.

Offset retention can have an impact. Cf this FAQ: https://docs.confluent.io/current/streams/faq.html#why-is-my-application-re-processing-data-from-the-beginning
Also cf How to commit manually with Kafka Stream? and How to commit manually with Kafka Stream? about how commits work.

Related

kafka consumers in consumer group not resuming messages after restart

Hope you are having good day.
I have an issue with kafka consumers on kubernetes. I am running 3 replicas inside a consumer group
I have a topic with 3 partitions and 3 brokers with offsets replication factor set to 3. My offset in consumer group is set to earliest.
When I start the consumer group, all are working fine with each consumer replica taking different partition and processing the data.
Issue: When by any means if a consumer replica inside the consumer group "abc-consumer-group" restarts OR if a broker(leader) restarts, it is not resuming from the point where it stopped. It states that I am up to date and no messages I have to process.
Any suggestions please where to look at?
Tried increasing rebalance, heartbeat, session timeout on broker level, no luck.
And yes whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka. I do see it happening but still not consumers are not resuming messages. It states nothing to process.

Kafka stream consumer skipping a few offset no log compaction enabled

kafka server version: 3.2.0
kstreams version : 2.7.2
I have a producer, which is producing to topic foo, I can see the offset from the producer in the logs.
We have kafka stream application reading from the same topic foo. What I am observing is that the consumer skips reading offset. Sometime the skip is over 30 to 40 offsets. I am printing the offset in process method using ProcessorContext.offset() method.
Skipping of offset seems to be very common, will using ProcessorContext.offset() result in every offset being printed ?.
Some points
No kafka rebalance has occurred.
No restarts of the container
We have 3 state store defined in the streams application, and the change log topic has replication factor of 1.
We have kafka broker outage where few broker were down some extend period of time, about 3 weeks back. I dont know how things impact the message i should consumer today.
We have NOT set processing.guarantee, so default should be AT_LEAST_ONCE. We do not have transactions enabled, so it cant be transactional messages. which are skipped
The log to print offset if the first line in the process method.
Question:
What internal kafka stream logs can I see to see if messages are consumed.
Any reason why the messages could be skipped

Apache Kafka partition offset rewinds during rebalance

We have a Kafka consumer application implemented using SpringBoot and Apache camel with manual commit. Topic which we consume has 30 partitions and retention period of 7 days. Consumer application deployed in 2 instances for HA and parallel processing is implemented using Apache Camel concurrent consumer configuration. Once we consume the data, we do message transformation and send to a REST endpoint. We have implemented the Circuit breaker pattern(Apache Camel Throttling route policy ) and for any runtime time issue with REST, Circuit Breaker kicks in and stops message consumption from Kafka Topic. Also we have are using the
max.poll.records = 100 and heartbeat interval = 1 ms
instead of default values to address the frequent commit offset failure exceptions. Load on the topic = 200 tps.
Problem statement:
Last week, we saw an issue - REST endpoint was slow in processing, and we saw the consumer group rebalance activity in broker logs and during this rebalance one of the partition consumer rewind the offset to 5 days old offset(almost 1 million back, offset id) and started the reprocessing of the messages causing huge duplication.
I looked in to both consumer log and broker log and not seen any exception or error and we are using the offset strategy as Latest. Also as I mentioned above we are using the manual commit and I believe, since consumer commits the offsets for every batch,
I expect when rebalance happens it should have rewind to at most one batch old offset, not 5 days old offset.
We have this implementation more than a year and saw this issue first time. We are using default values for most of the broker and consumer configuration other than max.poll.records, heartbeat interval and session timeout.
Kafka Broker = 2.4
Apache Camel= 3.0

Kafka consumer is not reading from only one partition out of 4

I was using Kafka 0.9 and recently migrated to Kafka 1.0, but the client I am using is still 0.9. Irrespective of this I was facing a problem where our consumers sometimes intermittently stop consuming from one or two of the partitions.
I have 5 consumers reading from 24 partitions, these are consumer JVM threads created from an application deployed in the single server. Frequently one of the consumer (thread) will stop reading from one of the partitions it would be consuming from.
Eg: One consumer thread would be reading from partition 1,2,3,and 4. It will stop reading from partition 1 and end up in building the lag. I have to restart the consumer to start picking those messages from that particular partition.
I want to understand the issue here.
My consumer configuration
session.timeout.ms=150000
request.timeout.ms=300000
max.partition.fetch.bytes=153600

Kafka Stream Internal Topics lag increases on taking Kafka Broker down

I have a Kafka Streams Application version - 0.11.0.1 which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers - version 0.11
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour.
Note: Whenever I take any kafka broker down, it brings down few consumers and Kafka streams (consumer) rebalances and lag of internal topics increase from 0 - few millions(1-10).
Is this because of any local state store config or something? How can I handle this?