Kafka log lag negative - apache-kafka

I was writting a java program to consume messages from kafka.
Kafka version was 2.4.0.
I used spring-kafka-2.5.0.RELEASAE.
I was using cmak 3.0.0.5 to monitor Kakfa.
All seemed well when I first deployed the service. But as I killed the java program and restarted it again, I saw many of the lags of the topic partitions were negative. And I got no clue why
these lags could be negative. We got no requirements on message consistency, so I was not sure if the messages were lost or repeated consumed when I restarted the java program, but I guess they were not lost, but consumed repeatedly.

Related

Records associated to a Kafka batch listener are not consumed for some partitions after several rebalances (resiliency testing)

Some weeks ago my project has been updated to use Kafka 3.2.1 instead of using the one coming with Spring Boot 2.7.3 (3.1.1). We made this upgrade to avoid an issue in Kafka streams – Illegal state and argument exceptions were not ending in the uncaught exception handler.
On the consumer side, we also moved to the cooperative sticky assignator.
In parallel, we started some resiliency tests and we started to have issues with Kafka records that are not consumed anymore on some partitions when using a Kafka batch listener. The issue occurred after several rebalances caused by the test (deployment is done in Kubernetes and we stopped some pods, micro services and broker instances). The issue not present on every listeners. Kafka brokers and micro-services are up and running.
During our investigations,
we enabled Kafka events and we can clearly see that the consumer is started
we can see in the logs that the partitions that are not consuming events are assigned.
debug has been enabled on the KafkaMessageListenerContainer. We see a lot of occurrences of Receive: 0 records and Commit list: {}
Is there any blocking points to use Kafka 3.2.1 with Spring Boot/Kafka 2.7.3/2.8.8?
Any help or other advices are more than welcome to progress our investigations.
Multiple listeners are defined, the retry seems to be fired from another listener (shared err handler?).
This is a known bug, fixed in the next release:
https://github.com/spring-projects/spring-kafka/issues/2382
https://github.com/spring-projects/spring-kafka/commit/3de1e89ba697ead04de171cfa35273bb0daddbe6
Temporary work around is to give each container its own error handler.

Kafka ConcurrentMessageListenerContainer stops consuming abruptly

I am using Spring kafka for consuming message using ConcurrentMessageListenerContainer , in production I am seeing it stops consuming the messages abruptly , without any errors , sometimes even a single consumer with the VM stops consuming while other consumers are still consuming (I have 15 partitions and 3 JVM , each has concurrency of 5).
When I restart the JVM , it starts consuming !!!
Is there anyway I can check periodically whether consumer had died or something and I can restart it without restarting the JVM !!
Most likely, the consumer thread is "stuck" in your code somewhere. I suggest you take a thread dump when this happens, to see what the thread is doing.

Kafka broker occassionally takes much longer than usual to load logs on startup

We are observing that Kafka brokers occasionally take much more time to load logs on startup than usual. Much longer in this case means 40 minutes instead of at most 1 minute. This happens during a rolling restart following the procedure described by Confluent. This happens after the broker reported that controlled shutdown was succesful.
Kafka Setup
Confluent Platform 5.5.0
Kafka Version 2.5.0
3 Replicas (minimum 2 in sync)
Controlled broker shutdown enabled
1TB of AWS EBS for Kafka log storage
Other potentially useful information
We make extensive use of Kafka Streams
We use exactly-once processing and transactional producers/consumers
Observations
It is not always the same broker that takes a long time.
It does not only occur when the broker is the active controller.
A log partition that loads quickly (15ms) can take a long time (9549 ms) for the same broker a day later.
We experienced this issue before on Kafka 2.4.0 but after upgrading to 2.5.0 it did not occur for a few weeks.
Does anyone have an idea what could be causing this? Or what additional information would be useful to track down the issue?

Kafka streams 1.0: processing timeout with high max.poll.interval.ms and session.timeout.ms

I am using a stateless processor using Kafka streams 1.0 with kafka broker 1.0.1
The problem is, the CustomProcessor get closed every few seconds, which resulted in rebalance signal, I am using the following configs:
session.timeout.ms=15000
heartbeat.interval.ms=3000 // set it to 1/3 session.timeout
max.poll.interval.ms=Integer.MAX_VALUE // make it that large as I am doing a intensive computational operations that might take up to 10 mins processing 1 kafka message (NLP operations)
max.poll.records=1
despite this configuration and my understanding of how kafka timeout configurations work, I see the consumer rebalancing every few seconds.
I already went through the below article and other stackoverflow questions. about how to tune the long time operations and avoid very long session timeout that will make failure detection so late, however I still see unexpected behavior, unless I misunderstand something.
KIP-62
Diff between session.timeout.ms and max.poll.interval
Kafka kstreams processing timeout
For the consumer environment setup, I have 8 machines each 16 code, and consuming from 1 topic with 100 partitions, I am following what practice this confluent doc here recommends.
Any pointers?
I figured it out. after lots of debugging and enable verbose logging for both kafka streams client and the broker, it turned out to 2 things:
There is a critical bug in streams 1.0.0 (HERE), so I upgraded my client version from 1.0.0 to 1.0.1
I update the value of the consumer property default.deserialization.exception.handler from org.apache.kafka.streams.errors.LogAndFailExceptionHandler to org.apache.kafka.streams.errors.LogAndContinueExceptionHandler.
After the above 2 changes, everything went so perfect with no restarts, I am using grafana to monitor the restarts, and for the past 48 hours, there is no single restart happened.
I might do more troubleshooting to make sure which of the 2 items above make the real fix, but I am on a hurry to deploy to production, so if anybody is intrested to start from there, go ahead, else, once I got time will do the further analysis and update the answer!
So happy to get this fixed!!!

Apache kafka storm, persistence during maintenance

I have Ubuntu 14.04TS. I use Node.js->Kafka->Storm->MongoDB chain. With initial development, everything goes well. Messages are finally stored into mMngoDB.
In Kafka, I have one Zookeeper and broker0 in kakfa1. broker1 in kafka2. With Storm, Zookeeper, nimbus, and DRPC are located at storm1. Supervisor and worker are located at storm2.
Now the questions is when I do update storm1 and storm2. I stopped all processes of storm1 and storm2. I suppose Kafka will buffer the message from Node.js. After I restarted both storm1 and storm2, and redeployed topology, I found messages during storm1 storm2's, down and up, are lost. So indeed, Kafka does not keep persistence of messages during storm maintenance period.
In my mind, Kafka will remember the last index of the message it receive acknowledgement.
In all, how could I prevent message from lost when storm is under maintenance.