Context:
Built a fraud detection kind of app.
Business logic is all fine, but when putting into production, Kafka cluster is becoming unstable.
The topic to which it wrote have approx 80 events/sec. post running for > 1 hours Kafka broker offsets are getting corrupted.
Which is making consumers on this broker to fail.
Haven’t find any issues with load testing on non-prod environments.
Not sure what was causing issues.
Topic config: single partition, replication 3.
Kafka version - 2.1
Flink version - 1.11.2-scala-2.12
Using same version for Kafka consumer, Kafka producer.
Any suggestions are welcomed, what can be the cause.
Related
Some weeks ago my project has been updated to use Kafka 3.2.1 instead of using the one coming with Spring Boot 2.7.3 (3.1.1). We made this upgrade to avoid an issue in Kafka streams – Illegal state and argument exceptions were not ending in the uncaught exception handler.
On the consumer side, we also moved to the cooperative sticky assignator.
In parallel, we started some resiliency tests and we started to have issues with Kafka records that are not consumed anymore on some partitions when using a Kafka batch listener. The issue occurred after several rebalances caused by the test (deployment is done in Kubernetes and we stopped some pods, micro services and broker instances). The issue not present on every listeners. Kafka brokers and micro-services are up and running.
During our investigations,
we enabled Kafka events and we can clearly see that the consumer is started
we can see in the logs that the partitions that are not consuming events are assigned.
debug has been enabled on the KafkaMessageListenerContainer. We see a lot of occurrences of Receive: 0 records and Commit list: {}
Is there any blocking points to use Kafka 3.2.1 with Spring Boot/Kafka 2.7.3/2.8.8?
Any help or other advices are more than welcome to progress our investigations.
Multiple listeners are defined, the retry seems to be fired from another listener (shared err handler?).
This is a known bug, fixed in the next release:
https://github.com/spring-projects/spring-kafka/issues/2382
https://github.com/spring-projects/spring-kafka/commit/3de1e89ba697ead04de171cfa35273bb0daddbe6
Temporary work around is to give each container its own error handler.
I am testing exactly once in kafka streams by shutting down multiple brokers.
But when i restart the brokers same message is getting produced multiple times on outbound topic.
I am using confluent version 6.1.0
Setting processing guarantee to exactly once beta
acks is set to all
Can any one please help me understand if i am missing any configurations?
I have a five node kafka cluster(confluent 5.5 community edition) with 3 zookeeper nodeseach on different aws instances.
While doing failover testing , noticed that the kafka cluster works fine even if all zookeeper nodes are down.
I was able to produce , consume and also create new consumers.
why does the kafka cluster not stop if it cannot connect to any zookeeper nodes ?
What would be the possible issues if we are unaware of such a failure scenario in production and kafka cluster continues to run without zookeeper connectivity ?
how do we handle such a scenario ?
Broker leader election, topic creation, simple ACLs (if you use them) still depend on Zookeeper. For other basic functions relying on the Kafka bootstrap protocols, they might still work, sure. There should definitely be broker logs indicating connection was lost
Ideally you'd have basic process healthchecking and incident management software that you shouldn't miss critical services going down in prod
How to handle? Restart Zookeeper...
We are observing that Kafka brokers occasionally take much more time to load logs on startup than usual. Much longer in this case means 40 minutes instead of at most 1 minute. This happens during a rolling restart following the procedure described by Confluent. This happens after the broker reported that controlled shutdown was succesful.
Kafka Setup
Confluent Platform 5.5.0
Kafka Version 2.5.0
3 Replicas (minimum 2 in sync)
Controlled broker shutdown enabled
1TB of AWS EBS for Kafka log storage
Other potentially useful information
We make extensive use of Kafka Streams
We use exactly-once processing and transactional producers/consumers
Observations
It is not always the same broker that takes a long time.
It does not only occur when the broker is the active controller.
A log partition that loads quickly (15ms) can take a long time (9549 ms) for the same broker a day later.
We experienced this issue before on Kafka 2.4.0 but after upgrading to 2.5.0 it did not occur for a few weeks.
Does anyone have an idea what could be causing this? Or what additional information would be useful to track down the issue?
I've been having all sorts of instabilities related to Kafka and offsets. Things like workers crashing on startup with exceptions related to invalidate offsets, and other things I don't understand.
I read that it is recommended to migrate offsets to be stored in Kafka instead of Zookeeper. I found the below in the Kafka documentation:
Migrating offsets from ZooKeeper to Kafka Kafka consumers in
earlier releases store their offsets by default in ZooKeeper. It is
possible to migrate these consumers to commit offsets into Kafka by
following these steps: 1. Set offsets.storage=kafka and
dual.commit.enabled=true in your consumer config. 2. Do a rolling
bounce of your consumers and then verify that your consumers are
healthy. 3. Set dual.commit.enabled=false in your consumer config. 4. Do
a rolling bounce of your consumers and then verify that your consumers
are healthy.
A roll-back (i.e., migrating from Kafka back to ZooKeeper) can also
be performed using the above steps if you set
offsets.storage=zookeeper.
http://kafka.apache.org/documentation.html#offsetmigration
But, again, I don't understand what this is instructing me to do. I don't see anywhere in my topology config where I configure where offsets are stored. Is it buried in the cluster yaml?
Any advice on if storing offsets in Kafka, rather than Zookeeper, is a good idea? And how I can perform this change?
At the time of this writing Storm's Kafka spout (see documentation/README at https://github.com/apache/storm/tree/master/external/storm-kafka) only supports managing consumer offsets in ZooKeeper. That is, all current Storm versions (up to 0.9.x and including 0.10.0 Beta) still rely on ZooKeeper for storing such offsets. Hence you should not perform the ZK->Kafka offset migration you referenced above because Storm isn't compatible yet.
You will need to wait until the Storm project -- specifically, its Kafka spout -- supports managing consumer offsets via Kafka (instead of ZooKeeper). And yes, in general it is better to store consumer offsets in Kafka rather than ZooKeeper, but alas Storm isn't there yet.
Update November 2016:
The situation in Storm has improved in the meantime. There's now a new, second Kafka spout that is based on Kafka's new 0.10 consumer client, which stores consumer offsets in Kafka (and not in ZooKeeper): https://github.com/apache/storm/tree/master/external/storm-kafka-client.
However, at the time I am writing this, there are still several issues being reported by the users in the storm-user mailing list (such as Urgent help! kafka-spout stops fetching data after running for a while), so I'd use this new Kafka spout with care, and only after thorough testing.