dealing with Kafka's exactly once processing edge-cases - apache-kafka

Folks,
Trying to do a POC for processing messages using Kafka for an implementation which absolutely requires only once processing. Example: as a payment system, process a credit card transaction only once
What edge cases should we protect against?
One failure scenario covered here is:
1.) If a consumer fails, and does not commit that it has read through a particular offset, the message will be read again.
Lets say consumers live in Kubernetes pods, and one of the hosts goes offline. We will potentially have messages that have been processed, but not marked as processed in Kafka before the pods went away due to underlying hardware issue. Do i understand this error scenario correctly?
Are there other failure scenarios which we need to fully understand on the producer/consumer side when thinking of Kafka doing only-once processing?
Thanks!

im going to basically repeat and exand on an answer i gave here:
a few scenarios can result in duplication:
consumers only periodically checkpoint their positions. a consumer crash can result in duplicate processing of some range or records
producers have client-side timeouts. this means the producer may think a request timed out and re-transmit while broker-side it actually succeeded.
if you mirror data between kafka clusters thats usually done with a producer + consumer pair of some sort that can lead to more duplication.
there are also scenarios that end in data loss - look up "unclean leader election" (disabling that trades with availability).
also - kafka "exactly once" configurations only work if all you inputs, outputs, and side effects happen on the same kafka cluster. which often makes it of limited use in real life.
there are a few kafka features you could try using to reduce the likelihood of this happening to you:
set enable.idempotence to true in your producer configs (see https://kafka.apache.org/documentation/#producerconfigs) - incurs some overhead
use transactions when producing - incurs overhead and adds latency
set transactional.id on the producer in case your fail over across machines - gets complicated to manage at scale
set isolation.level to read_committed on the consumer - adds latency (needs to be done in combination with 2 above)
shorten auto.commit.interval.ms on the consumer - just reduces the window of duplication, doesnt really solve anything. incurs overhead at really low values.
I have to say that as someone who's been maintaining a VERY large kafka installation for the past few years I'd never use a bank that relied on kafka for its core transaction processing though ...

Related

Consuming messages in a Kafka topic ASAP

Imagine a scenario in which a producer is producing 100 messages per second, and we're working on a system that consuming messages ASAP matters a lot, even 5 seconds delay might result in a decision not to take care of that message anymore. also, the order of messages does not matter.
So I don't want to use a basic queue and a single pod listening on a single partition to consume messages, since in order to consume a message, the consumer needs to make multiple remote API calls and this might take time.
In such a scenario, I'm thinking of a single Kafka topic, with 100 partitions. and for each partition, I'm gonna have a separate machine (pod) listening for partitions 0 to 99.
Am I thinking right? this is my first project with Kafka. this seems a little weird to me.
For your use case, think of partitions = max number of instances of the service consuming data. Don't create extra partitions if you'll have 8 instances. This will have a negative impact if consumers need to be rebalanced and probably won't give you any performace improvement. Also 100 messages/s is very, very little, you can make this work with almost any technology.
To get the maximum performance I would suggest:
Use a round robin partitioner
Find a Parallel consumer implementation for your platform (for jvm)
And there a few producer and consumer properties that you'll need to change, but they depend your environment. For example batch.size, linger.ms, etc. I would also check about the need to set acks=all as it might be ok for you to lose data if a broker dies given that old data is of no use.
One warning: In Java, the standard kafka consumer is single threaded. This surprises many people and I'm not sure if the same is true for other platforms. So having 100s of partitions won't give any performance benefit with these consumers, and that's why it's important to use a Parallel Consumer.
One more warning: Kafka is a complex broker. It's trivial to start using it, but it's a very bumpy journey to use it correctly.
And a note: One of the benefits of Kafka is that it keeps the messages rather than delete them once they are consumed. If messages older than 5 seconds are useless for you, Kafka might be the wrong technology and using a more traditional broker might be easier (activeMQ, rabbitMQ or go to blazing fast ones like zeromq)
Your bottleneck is your application processing the event, not Kafka.
when you have ten consumers, there is overhead for connecting each consumer to Kafka so it will lower the performance.
I advise focusing on your application performance rather than message broker.
Kafka p99 Latency is 5 ms with 200 MB/s load.
https://developer.confluent.io/learn/kafka-performance/

Kafka - how to avoid losing data in emergency situations

Recently, we had a production incident when Kafka consumers were repeatedly processing the same Kafka records again and again, and Kafka was rebalancing all the time. But I do not want to write here about this issue - we resolved it (by lowering the max-poll-records) and it works fine, now.
But the incident made me wonder - could we have lost some messages during this incident?
For instance: The documentation for auto-offset-reset says that this parameter applies "...if an offset is out of range". According to Kafka auto.offset.reset query it may happen e.g. "if the Consumer offset is less than the smallest offset". That is, if we had auto-offset-reset=latest and topic cleanup was triggered during the incident, we could have lost all the unprocessed data in the topic (because the offset would be set to the end of the topic, in this case). Therefore, IMO, it is never a good idea to have auto-offset-reset=latest if you need at-least-once delivery.
Actually, there are plenty of other situations where there is a threat of data loss in Kafka if not everything is set up correctly. For instance:
When the schema registry is not available, messages can get lost:
How to avoid losing messages with Kafka streams
After application restart, unprocessed messages are skipped despite that auto-offset-reset=earliest. We had this problem too in a topic (=not in every topic). Perhaps this is the same case.
etc.
Is there a cook-book how to set everything related to Kafka properly in order to make the application robust (with respect to Kafka) and prevent data loss? We've set up everything we consider important, but I'm not sure that we haven't overlooked something. And I cannot imagine all bad things that are possible in order to prevent them. For instance:
We have Kafka consumers with the same groupId running in different (geographically separated) networks. Does it matter? Nowadays probably not, but in the past probably yes, according to this answer.

Kafka as a message queue for long running tasks

I am wondering if there is something I am missing about my set up to facilitate long running jobs.
For my purposes it is ok to have At most once message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).
I have the following in order to achieve the competing consumer pattern:
A topic
X consumers in the same group
P partitions in a topic (where P >= X always)
My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of max.poll.interval.ms to reflect this.
However this comes with some negative consequences:
if some message exceeds this length of time then in a worst case scenario a the consumer processing this message will have to wait up to the value of max.poll.interval.ms for a rebalance
if I need to scale and increase the number of consumers based on load then any new consumers might also have to wait the value of max.poll.interval.ms for a rebalance to occur in order to process any new messages
As it stands at the moment I see that I can proceed as follows:
Set max.poll.interval.ms to be a small value and accept that every consumer processing every message will time out and go through the process of having assignments revoked and waiting a small amount of time for a rebalance
However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this.
Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable.
I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.
I appreciate any advise that anybody can offer on this subject.
Using Kafka as a Job queue for scheduling long running process is not a good idea as Kafka is not a queue in the strictest sense and semantics for failure handling and retries are limited. Though you might be able to achieve a compromise by playing around with certain configuration for rebalance or timeout, it is likely to remain brittle design. Simple answer is that Kafka was not designed for these kind of usecases.
The idea of max.poll.interval.ms is to prevent livelock situation (see), but in your case, consumer will send a false positive to the Kafka broker and will trigger a rebalance as there is no way to distinguish between a livelock and a legitimate long process.
You should think about the tradeoffs between living with the negative consequences you mentioned Vs. introducing a new technology which helps you to model a job queue in a better way. For a more complex usecase, check out how slack is doing it.
The way we got around the issues we were having was as suggested in the comments.
We decided to decouple the message processing from the consumer polling.
On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.
We also did some work with trying to reduce the processing times for messages.
However some messages still take time that can be measured in minutes.
This has worked for us now for some time with no issues.
Thanks for this suggestions in comments #Donal

Kafka KStream OutOfOrderSequenceException

Our application intermittently encounters OutOfOrderSequenceException in our streams code. Which causes stream thread to stop.
Implementation is simple, 2 KStreams join and output to another topic.
When searching for a solution to this OutOfOrderSequenceException
I have found below documentation on Confluent
https://docs.confluent.io/current/streams/concepts.html#out-of-order-handling
But could not find what settings, config or trade-offs are being referred here ?
How to manually do bookkeeping ?
If users want to handle such out-of-order data, generally they need to
allow their applications to wait for longer time while bookkeeping
their states during the wait time, i.e. making trade-off decisions
between latency, cost, and correctness. In Kafka Streams, users can
configure their window operators for windowed aggregations to achieve
such trade-offs (details can be found in the Developer Guide).
From the JavaDocs of OutOfOrderSequenceException:
This exception indicates that the broker received an unexpected sequence number from the producer, which means that data may have been lost. If the producer is configured for idempotence only (i.e. if enable.idempotence is set and no transactional.id is configured), it is possible to continue sending with the same producer instance, but doing so risks reordering of sent records. For transactional producers, this is a fatal error and you should close the producer.
Sequence numbers are internally assigned numbers to each message that is written into a topic.
Because it is an internal error, it's hard to tell what the root cause could be though.
Updates :
After updating Kafka Brokers and KStream version, issue seems to have subsided.
Also, as per the recommendation,
https://kafka.apache.org/10/documentation/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency
I have updated acks to all. replication factor was already 3.

Reliable fire-n-forget Kafka producer implementation strategy

I'm in middle of a 1st mile problem with Kafka. Everybody deals with partitioning, etc. but how to handle the 1st mile?
My system consists of many applications producing events distributed on nodes. I need to deliver these events to a set of applications acting as consumers in a reliable/fail-safe way. The messaging system of choice is Kafka (due its log nature) but it's not set in stone.
The events should be propagated in a decoupled fire-n-forget manner as most as possible. This means the producers should be fully responsible for reliable delivering their messages. This means apps producing events shouldn't worry about the event delivery at all.
Producer's reliability schema has to account for:
box connection outage - during an outage producer can't access network at all; Kafka cluster is thus not reachable
box restart - both producer and event producing app restart (independently); producer should persist in-flight messages (during retrying, batching, etc.)
internal Kafka exceptions - message size was too large; serialization exception; etc.
No library I've examined so far covers these cases. Is there a suggested strategy how to solve this?
I know there are retriable and non-retriable errors during Producer's send(). On those retriable, the library usually handles everything internally. However, non-retriable ends with an exception in async callback...
Should I blindly replay these to infinity? For network outages it should work but how about Kafka internal errors - say message too large. There might be a DeadLetterQueue-like mechanism + replay. However, how to deal with message count...
About the persistence - a lightweight DB backend should solve this. Just creating a persistent queue and then removing those already send/ACKed. However, I'm afraid that if it was this simple it would be already implemented in standard Kafka libraries long time ago. Performance would probably go south.
Seeing things like KAFKA-3686 or KAFKA-1955 makes me a bit worried.
Thanks in advance.
We have a production system whose primary use case is reliable message delivery. I can't go in much detail, however i can share a high level design on how we achieve this. However this system is guarantees "atleast once delivery" messaging sematics.
Source
First we designed a message schema, and all the message sent to this
system must follow it.
Then we write the message to the a mysql message table, which is sharded by
date, with a field marked as delivered or not
We have a app constantly polling db, with rows marked un-delivered, picks up a row, constructs the message and send it to the load balancer, this is a blocking call and
updates the message row to delivered, only when returned 200
In case of 5xx, the app will retry the message with sleep back off. Also you can make the retries configurable as per your need.
Each source system maintains their own polling app and db.
Producer Array
This is basically a array of machines under a load balancer waiting for incoming messages and produce those to the Kafka Cluster.
We maintain 3 replicas of each topic and in the producer Config we keep acks = -1 , which is very important for your fire-n-forget requirement. As per the doc
acks=all This means the leader will wait for the full set of in-sync
replicas to acknowledge the record. This guarantees that the record
will not be lost as long as at least one in-sync replica remains
alive. This is the strongest available guarantee. This is equivalent
to the acks=-1 setting
As I said producing is a blocking call, and it will return 2xx if the message is produced succesfully across all 3 replicas.
4xx, if message is doesn't meet the schema requirements
5xx, if the kafka broker threw some exception.
Consumer Array
This is a normal array of machines, running Kafka High level Consumers for the topic's consumer groups.
We are currently running this setup with few additional components for some other functional flows in production and it is basically fire-n-forget from the source point of view.
This system addresses all of your concerns.
box connection outage : Unless the source polling app gets 2xx,it
will produce again-again which may lead to duplicates.
box restart : Due to retry mechanism of the source , this shouldn't be a problem as well.
internal Kafka exceptions : Taken care by polling app, as producer array will reply with 5xx unable to produce, and will be further retried.
Acks = -1, also ensures that all the replicas are in-sync and have a copy of the message, so broker going down will not be a issue as well.