I am using Spring kafka for consuming message using ConcurrentMessageListenerContainer , in production I am seeing it stops consuming the messages abruptly , without any errors , sometimes even a single consumer with the VM stops consuming while other consumers are still consuming (I have 15 partitions and 3 JVM , each has concurrency of 5).
When I restart the JVM , it starts consuming !!!
Is there anyway I can check periodically whether consumer had died or something and I can restart it without restarting the JVM !!
Most likely, the consumer thread is "stuck" in your code somewhere. I suggest you take a thread dump when this happens, to see what the thread is doing.
Related
Some weeks ago my project has been updated to use Kafka 3.2.1 instead of using the one coming with Spring Boot 2.7.3 (3.1.1). We made this upgrade to avoid an issue in Kafka streams – Illegal state and argument exceptions were not ending in the uncaught exception handler.
On the consumer side, we also moved to the cooperative sticky assignator.
In parallel, we started some resiliency tests and we started to have issues with Kafka records that are not consumed anymore on some partitions when using a Kafka batch listener. The issue occurred after several rebalances caused by the test (deployment is done in Kubernetes and we stopped some pods, micro services and broker instances). The issue not present on every listeners. Kafka brokers and micro-services are up and running.
During our investigations,
we enabled Kafka events and we can clearly see that the consumer is started
we can see in the logs that the partitions that are not consuming events are assigned.
debug has been enabled on the KafkaMessageListenerContainer. We see a lot of occurrences of Receive: 0 records and Commit list: {}
Is there any blocking points to use Kafka 3.2.1 with Spring Boot/Kafka 2.7.3/2.8.8?
Any help or other advices are more than welcome to progress our investigations.
Multiple listeners are defined, the retry seems to be fired from another listener (shared err handler?).
This is a known bug, fixed in the next release:
https://github.com/spring-projects/spring-kafka/issues/2382
https://github.com/spring-projects/spring-kafka/commit/3de1e89ba697ead04de171cfa35273bb0daddbe6
Temporary work around is to give each container its own error handler.
I was writting a java program to consume messages from kafka.
Kafka version was 2.4.0.
I used spring-kafka-2.5.0.RELEASAE.
I was using cmak 3.0.0.5 to monitor Kakfa.
All seemed well when I first deployed the service. But as I killed the java program and restarted it again, I saw many of the lags of the topic partitions were negative. And I got no clue why
these lags could be negative. We got no requirements on message consistency, so I was not sure if the messages were lost or repeated consumed when I restarted the java program, but I guess they were not lost, but consumed repeatedly.
We are noticing Streams Apps threads fail transactions during rolling restarts of our Kafka Brokers. The transaction failure causes stream thread fencing which in turn causes a restart of the thread and re-balancing. The re-balancing causes some delay in processing. Our goal is to make broker restarts as smooth as possible and prevent processing delays as much as possible.
For our rolling Broker restarts we use the controlled.shutdown=true configuration, and before each restart we wait for all partitions to be in-sync across all replicas.
For our Streams Apps we have properly configured group.instance.id and an appropriate session.timeout.ms so that rolling restarts of the streams apps themselves are smooth and without re-balances.
From the Kafka Streams app logs I have identified a sequence of events leading up to the fencing:
Broker starts shutting down
App logs error producing to topic due to NOT_LEADER_OR_FOLLOWER
App heartbeats failing because coordinator is restarting broker
App discovers new group coordinator (this bounces a a bit between the restarting broker and live brokers)
App stabilizes
Broker starting up again
App fails to do fetch request to starting broker due to FETCH_SESSION_ID_NOT_FOUND
App discovers starting broker as transaction coordinator
App transaction fails due to one of two reasons:
InvalidProducerEpochException: Producer attempted to produce with an old epoch.
ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one
Stream threads end up in fatal error state, get fenced and restarted which causes a rebalance.
What could be causing the two exceptions that cause stream thread transactions to fail? My intuition is that the broker starting up is assigned as transaction coordinator before it has synced its transaction states with the in-sync brokers. This could explain old epochs or different transactional ids to be known by that broker.
How can we further identify what is going wrong here and how it can be improved?
you can set request.timeout.ms in kafka streams which will make stream API wait for a longer period of time. if kafka broker is not up in a given period of time then only it will throw an exception which can be handled by using ProductionExceptionHandler as described in Handling exceptions in Kafka streams
I am using Spring Boot 2.1.1.RELEASE and Spring Cloud Greenwich.RC2, and the managed version for spring-cloud-stream-binder-kafka is 2.1.0RC4. The Kafka version is 1.1.0. I have set the following properties as the messages should not be consumed if there is an error.
spring.cloud.stream.bindings.input.group=consumer-gp-1
...
spring.cloud.stream.kafka.bindings.input.consumer.autoCommitOnError=false
spring.cloud.stream.kafka.bindings.input.consumer.enableDlq=false
spring.cloud.stream.bindings.input.consumer.max-attempts=3
spring.cloud.stream.bindings.input.consumer.back-off-initial-interval=1000
spring.cloud.stream.bindings.input.consumer.back-off-max-interval=3000
spring.cloud.stream.bindings.input.consumer.back-off-multiplier=2.0
....
There are 20 partitions in the Kafka topic and Kerberos is used for authentication (not sure if this is relevant).
The Kafka consumer is calling a web service for every message it processes, and if the web service is unavailable then I expect that the consumer will then try to process the message for 3 times before it moves on to the next message. So for my test, I disabled the webservice, and therefore none of the message could be processed correctly. From the logs I can see that this is happening.
After a while I stopped and then restarted the Kafka consumer (webservice is still disabled). I was expecting that after the restart of the Kafka consumer, it would attempt to process the messages that was not successfully processed the first time around. From the logs (I printed out each message with its fields) after the restart of the Kafka Consumer I couldn't see this happening. I thought the partition might be influencing something, but I check the logs and all 20 partitions were assigned to this single consumer.
Is there a property I have missed? I thought the expected behavior when I restart the consumer the second time, is that Kafka broker would pass the records that were not successfully processed to the consumer again.
Thanks
Parameters working as expected. See comment.
I have a Kafka Streams app running (0.10.2.1). When I shut down the Kafka cluster the streams app continues to wait for the next message, when the cluster is brought back up, it will resume consuming messages. For the duration that the cluster is down the app appears to be working fine. I have tested this for over 45 minutes.
I would expect Kafka to throw an exception or stop. I have configure a StateListener to log when KafkaStreams shuts down, however it is never invoked.
kafkaStreams.setStateListener((newState, _) => {
if (newState == KafkaStreams.State.NOT_RUNNING) {
Log.error("Kafka died unexpectedly.")
}
})
How do I get Kafka to throw an exception or shutdown when it cannot connect to the cluster?
Note: this assumes that cluster goes down after the app has started
Why would you want the Kafka Streams app to go down?
The app should be resilient to broker failures, that is, keep going patiently until the broker recovers and it seems that this is what it's doing. If you have multiple instances of the Kafka Streams application and one of them loses connectivity to the broker, the load will be re-balanced onto the remaining instances. If each instance that lost connectivity just shut itself down, you would be losing instances and with them losing redundancy and parallelism even if the broker connectivity recovered. The way it is now Kafka Streams is designed for resilience. I'd argue that this is the correct behaviour.
IMHO if you want to detect broker (or connectivity) failures, that's a use case for monitoring, not for introducing failures into Kafka Streams applications.