Is message deduplication essential on the Kafka consumer side? - apache-kafka

Kafka documentation states the following as the top scenario:
To process payments and financial transactions in real-time, such as
in stock exchanges, banks, and insurances
Also, regarding the main concepts, right at the very top:
Kafka provides various guarantees such as the ability to process
events exactly-once.
It’s funny the document says:
Many systems claim to provide "exactly once" delivery semantics, but
it is important to read the fine print, most of these claims are
misleading…
It seems obvious that payments/financial transactions must be processed „exactly-once“, but the rest of Kafka documentation doesn't make it obvious how this should be accomplished.
Let’s focus on the producer/publisher side:
If a producer attempts to publish a message and experiences a network
error it cannot be sure if this error happened before or after the
message was committed. This is similar to the semantics of inserting
into a database table with an autogenerated key. … Since 0.11.0.0, the
Kafka producer also supports an idempotent delivery option which
guarantees that resending will not result in duplicate entries in the
log.
KafkaProducer only ensures that it doesn’t incorrectly resubmit messages (resulting in duplicates) itself. Kafka cannot cover the case where client app code crashes (along with KafkaProducer) and it is not sure if it previously invoked send (or commitTransaction in case of transactional producer) which means that application-level retry will result in duplicate processing.
Exactly-once delivery for other destination systems generally
requires cooperation with such systems, but Kafka provides the offset
which makes implementing this feasible (see also Kafka Connect).
The above statement is only partially correct, meaning that while it exposes offsets on the Consumer side, it doesn’t make exactly-once feasible at all on the producer side.
Kafka consume-process-produce loop enables exactly-once processing leveraging sendOffsetsToTransaction, but again cannot cover the case of the possibility of duplicates on the first producer in the chain.
The provided official demo for EOS (Exactly once semantics) only provides an example for consume-process-produce EOS.
Solutions involving DB transaction log readers which read already committed transactions, also cannot be sure if they will produce duplicate messages in case they crash.
There is no support for a distributed transaction (XA) involving a database and the Kafka producer.
Does all of this mean that in order to ensure exactly once processing for payments and financial transactions (Kafka top use case!), we absolutely must perform business-level message deduplication on the consumer side, inspite of the Kafka transport-level “guarantees”/claims?
Note: I’m aware of:
Kafka Idempotent producer
but I would like a clear answer if deduplication is inevitable on the consumer side.

You must deduplicate on consumer side since rebalance on consumer side can really cause processing of events more than once in a consumer group based on fetch size and commit interval parameters.
If a consumer exits without acknowledging back to broker, Kafka will assign those events to another consumer in the group. Example if you are pulling a batch size of 5 events, if consumer dies or goes for a restart after processing first 3(If the external api/db fails OR the worse case your server runs out of memory and crashes), the current consumer dies abruptly without making a commit back/ack to broker. Hence the same batch gets assigned to another consumer from group(rebalance) where it starts supplies the same event batch again which will result in re-processing of same set of records resulting in duplication. A good read here : https://quarkus.io/blog/kafka-commit-strategies/
You can make use of internal state store of Kafka for deduplication. Here there is no offset/partition tracking, its kind of cache(persistent time bound on cluster).
In my case we push correlationId(a unique business identifier in incoming event) into it on successful processing of events, and all new events are checked against this before processing to make sure its not a duplicate event. Enabling state store will create more internal topics in Kafka cluster, just an FYI.
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html#state-stores

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

How to manage Kafka transactional producer objects in request oriented applications

What is the best practice for managing Kafka producer objects in request oriented (e.g. http or RPC servers) applications, when configured as transactional producers? Specifically, how to share producer objects among serving threads, and how to define the transactional.id configuration value for those objects?
In non-transactional usage, producer objects are thread safe and it is common to share one object among all request serving threads. It is also straightforward to setup transactional producer objects to be used by kafka consumer threads, just instantiating one object for each consumer thread works well.
Combining transactional producers with request oriented applications appears to be more complicated, as the life-cycle of serving threads is usually dynamically controlled by a thread pool. I can think of a few options, all with downsides:
Share a single object, protected against concurrency by some kind of mutex. Contention under load would probably be a serious problem.
Instantiate a producer object for each request coming in. KafkaProducer objects are slow to initialize, as they maintain network connections, threads, and other heavyweight objects; paying this cost for each request seems impractical.
Maintain a pool of producer objects, and lease one for each request. The main downside I can see is the amount of machinery required. It is also unclear how to configure transactional.id for these objects, as their lifecycle does not map cleanly to a shard identifier in a partitioned, stateful, application as the documentation says.
Are there other options? Is there an optimal approach?
TL;DR
The transactional id is for preventing duplicates caused by zombie processes in the read-process-write pattern where you read from and produce to kafka topics. For request oriented applications, e.g. messages being produced by an incoming http request, transactional id doesn't bring any benefit (of course you still need to assign one if you want to use transactions and shouldn't be repeated between producers in the same process or different processes in your cluster)
Long answer
As the docs say, transactional producers are not thread safe
As is hinted at in the example, there can be only one open transaction per producer. All messages sent between the beginTransaction() and commitTransaction() calls will be part of a single transaction
so as you correctly explained there can't be concurrent access to the producer so we must pick one of the three options you described.
For this answer I'm going to assume that request oriented applications corresponds to http requests as the mechanism is triggering a message being produced with a transaction (actually, more than one message, otherwise will be enough with idempotent producers and transactions won't be needed)
In terms of correctness all of them are ok as, option 1 would work but depending on your application throughput it could have a high contention, option 2 will also work but you will pay the price of a higher latency and won't be very efficient.
IMHO I think option 3 could be the best since is a compromise between of the two previous options, although of course requires a more careful implementation than just opening a new producer each time.
Transactional id
The question that remains is how to assign a transactional id to the producer, specially in the last case (although both options 1 and 3 share the same concern, since in both cases we are reusing a producer with the same transactional id to handle different requests).
To answer this we first need to understand that the goal of transactional.id is to protect us from having duplicate message being produced caused by zombie processes (a process that hangs for a while, e.g. bc of a long gc pause, and is considered dead but after a while comes back and continues), this is called zombie fencing.
An important detail to understand the need of zombie fencing is understanding in which use case it could happen and this is the read-process-write pattern where you read from a topic, process the element and write to an output topic and the offset topic, which give us atomicity and Exactly-once semantics (if you are not doing any side effects on the process step).
Idempotent producers prevent us from having duplicates caused by producer retries (where the message was persisted by the broker but the ack wasn't received by the producer) and two-phase commit within kafka (where we are not only writing to the output but also marked the message as consumed by also producing to the offset topic) prevent us from having duplicates caused by consuming the message more than once (if the process crashes after producing to the output topic but before committing the offset).
There is still a subtle case where a duplicate can be introduced and it is a zombie producer, which is fenced by monotonically increasing an epoch each time a producer calls initTransactions that will be send with every message the producer sends.
So, for a producer to be fenced, another producer should have being started with the same transaction id, the key here is explained by Jason Gustafson in this talk
"what we are looking for is a guarantee that for each input partition there is only a single write that is responsible for reading that data and writing the output"
This means the transactional.id is assigned in terms of the partition is being consumed in the "read-process-write" pattern.
So if a process that has assigned partition 0 of topic A is considered dead, a rebalance will kick off and the new process that is assigned should create a producer with the same transactional.id, that's why it should be something like this <prefix><group>.<topic>.<partition> as described in this answer, where the partition is part of the transactional.id. This also means a producer per partition assigned, which could also represent an overhead depending on how many topics and partitions your consumers are being assigned.
This slides from the talk clarifies this situation
Transactional id before process crash
Transactional id reassigned to other process after crash
Transactional id in http requests
Going back to your original question, http requests won't follow the read-process-write pattern where zombies can introduce duplicates, because each http request will be unique, even if you introduce a unique identifier it will be a different message from the point of view of the transactional producer.
In this case I would argue that you may still have value using the transactional producer if you want the atomicity of writing to two different topics, but you can choose a random transactional id for option 2, or reuse it for options 1 and 3.
UPDATE
My answer is outdated since is based in an old version of kafka.
The overhead of having one producer per partition described before was a concern that was tackled in KIP-447
This architecture does not scale well as the number of input partitions increases. Every producer come with separate memory buffers, a separate thread, separate network connections. This limits the performance of the producer since we cannot effectively use the output of multiple tasks to improve batching. It also causes unneeded load on brokers since there are more concurrent transactions and more redundant metadata management.
This is the main difference as explained in this post
When the partition assignment is finalized after a consumer group rebalance, the first step for the consumer is to always get the next offset to begin fetching data. With this observation, the OffsetFetch protocol protection is enhanced, such that when a consumer group has pending transactional offsets associated with one partition, the OffsetFetch call can be blocked until the associated transaction completes. Previously, the “outdated” offset data would be returned and the application allowed to continue immediately.
Whit this new feature, the use of transactional.id is no longer clear to me.
Although it is still unclear why fencing requires both blocking the poll if there are pending transactions while it seems to me that the sending the consumer group metadata should be enough (I assume a zombie producer will be fenced by commiting with an old generation.id for that group.id, the generation.id being bumped with each rebalance) it seems the transactional.id doesn't play a major role anymore. e.g. spring docs says
With mode V1, the producer is "fenced" if another instance with the same transactional.id is started. Spring manages this by using a Producer for each group.id/topic/partition; when a rebalance occurs a new instance will use the same transactional.id and the old producer is fenced.
With mode V2, it is not necessary to have a producer for each group.id/topic/partition because consumer metadata is sent along with the offsets to the transaction and the broker can determine if the producer is fenced using that information instead.

When to use Kafka transactional API?

I was trying to understand Kafka's transactional API. This link defines atomic read-process-write cycle as follows:
First, let’s consider what an atomic read-process-write cycle means. In a nutshell, it means that if an application consumes a message A at offset X of some topic-partition tp0, and writes message B to topic-partition tp1 after doing some processing on message A such that B = F(A), then the read-process-write cycle is atomic only if messages A and B are considered successfully consumed and published together, or not at all.
It further says says following:
Using vanilla Kafka producers and consumers configured for at-least-once delivery semantics, a stream processing application could lose exactly once processing semantics in the following ways:
The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
Finally, in distributed environments, applications will crash or—worse!—temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”
We designed transaction APIs in Kafka to solve the second and third problems. Transactions enable exactly-once processing in read-process-write cycles by making these cycles atomic and by facilitating zombie fencing.
Doubts:
Points 2 and 3 above describe when message duplication can occur which are dealt with using transactional API. Does transactional API also help to avoid message loss in any scenario?
Most online (for example, here and here) examples of Kafka transactional API involve:
while (true)
{
ConsumerRecords records = consumer.poll(Long.MAX_VALUE);
producer.beginTransaction();
for (ConsumerRecord record : records)
producer.send(producerRecord(“outputTopic”, record));
producer.sendOffsetsToTransaction(currentOffsets(consumer), group);
producer.commitTransaction();
}
This is basically read-process-write loop. So does transactional API useful only in read-process-write loop?
This article gives example of transactional API in non read-process-write scenario:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
It says:
This allows a producer to send a batch of messages to multiple partitions such that either all messages in the batch are eventually visible to any consumer or none are ever visible to consumers.
Is this example correct and shows another way to use transactional API different from read-process-write loop? (Note that it also does not commit offset to transaction.)
In my application, I simply consume messages from kafka, do processing and log them to the database. That is my whole pipeline.
a. So, I guess this is not read-process-write cycle. Is Kafka transactional API of any use to my scenario?
b. Also I need to ensure that each message is processed exactly once. I guess setting idempotent=true in producer will suffice and I dont need transactional API, right?
c. I may run multiple instances of pipeline, but I am not writing processing output to Kafka. So I guess this will never involve zombies (duplicate producers writing to kafka). So, I guess transactional API wont help me to avoid duplicate processing scenario, right? (I might have to persist both offset along with processing output to the database in the same database transaction and read the offset during producer restart to avoid duplicate processing.)
a. So, I guess this is not read-process-write cycle. Is Kafka
transactional API of any use to my scenario?
It is a read-process-write, except you are writing to a database instead of Kafka. Kafka has its own transaction manager and thus writing inside a transaction with idempotency would enable exactly once processing, assuming you can resume the state of your consumer-write processor correctly. You cannot do that with a DB because the DB's transaction manager doesn't sync with Kafka's. What you can do instead is make sure that even if kafka transactions are not atomic with respect to your database, they are still eventually consistent.
Let's assume your consumer reads, writes to the DB and then acks. If the DB fails you don't ack and you can resume normally based on the offset. If the ack fails you will process twice and save to the DB twice. If you can make this operation idempotent, then you are safe. This means that your processor must be pure and the DB has to dedupe: processing the same message twice should always lead to the same result on the DB.
b. Also I need to ensure that each message is processed exactly once.
I guess setting idempotent=true in producer will suffice and I dont
need transactional API, right?
Assuming that you respect the requirements from point a, exactly once processing with persistence on a different store also requires that between your initial write and the duplicate no other change has happened to the objects that you are saving. Imagine having a value written as X, then some other actor changes it to Y, then the message is reprocessed and changes it back to X. This can be avoided for example, by making your database table be a log, similar to a kafka topic.
c. I may run multiple instances of pipeline, but I am not writing processing output to Kafka. So I guess this will never involve zombies (duplicate producers writing to kafka). So, I guess transactional API wont help me to avoid duplicate processing scenario, right? (I might have to persist both offset along with processing output to the database in the same database transaction and read the offset during producer restart to avoid duplicate processing.)
It is the producer which writes to the topic you consume from that may create zombie messages. That producer needs to play nice with kafka so that zombies are ignored. The transactional API together with your consumer will make sure that this producer writes atomically and your consumer reads committed messages, albeit not atomically. If you want exactly once idempotency is enough. If the messages are supposed to be atomically written you need transactions too. Either way your read-write/consume-produce processor needs to be pure and you have to dedupe. Your DB is also part of this processor since the DB is the one that actually persists.
I've looked for a bit on the internet, maybe this link helps you: processing guarantees
The links you posted: exactly once semantics and transactions in kafka are great.

How to handle various failure conditions in Kafka

Issue we were facing:
In our system we were logging a ticket in database with status NEW and also putting it in the kafka queue for further processing. The processors pick those tickets from kafka queue, do processing and update the status accordingly. We found that some tickets are left in NEW state forever. So we were guessing whether tickets are failing to get produced in the queue or are no getting consumed.
Message loss / duplication scenarios (and some other related points):
So I started to dig exhaustively to know in what all ways we can face message loss and duplication in Kafka. Below I have listed all possible message loss and duplication scenarios that I can find in this post:
How data loss can occur in different approaches to handle all replicas down
Handle by waiting for leader to come online
Messages sent between all replica down and leader comes online are lost.
Handle by electing new broker as a leader once it comes online
If new broker is out of sync from previous leader, all data written between the
time where this broker went down and when it was elected the new leader will be
lost. As additional brokers come back up, they will see that they have committed
messages that do not exist on the new leader and drop those messages.
How data loss can occur when leader goes down, while other replicas may be up
In this case, the Kafka controller will detect the loss of the leader and elect a new leader from the pool of in sync replicas. This may take a few seconds and result in LeaderNotAvailable errors from the client. However, no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.
When a consumer may miss to consume a message
If Kafka is configured to keep messages for a day and a consumer is down for a period of longer than a day, the consumer will lose messages.
Evaluating different approaches to consumer consistency
Message might not be processed when consumer is configured to receive each message at most once
Message might be duplicated / processed twice when consumer is configured to receive each message at least once
No message is processed multiple times or left unprocessed if consumer is configured to receive each message exactly once.
Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers.
Kafka makes the following guarantees about data consistency and availability:
Messages sent to a topic partition will be appended to the commit log in the order they are sent,
a single consumer instance will see messages in the order they appear in the log,
a message is ‘committed’ when all in sync replicas have applied it to their log, and
any committed message will not be lost, as long as at least one in sync replica is alive.
Approach I came up with:
After reading several articles, I felt I should do following:
If message is not enqueued, producer should resend
For this producer should listen for acknowledgement for each message sent. If no ackowledement is received, it can retry sending message
Producer should be async with callback:
As explained in last example here
How to avoid duplicates in case of producer retries sending
To avoid duplicates in queue, set enable.idempotence=true in producer configs. This will make producer ensure that exactly one copy of each message is sent. This requires following properties set on producer:
max.in.flight.requests.per.connection<=5
retries>0
acks=all (Obtain ack when all brokers has committed message)
Producer should be transactional
As explained here.
Set transactional id to unique id:
producerProps.put("transactional.id", "prod-1");
Because we've enabled idempotence, Kafka will use this transaction id as part of its algorithm to deduplicate any message this producer sends, ensuring idempotency.
Use transactions semantics: init, begin, commit, close
As explained here:
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record1);
producer.send(record2);
producer.commitTransaction();
} catch(ProducerFencedException e) {
producer.close();
} catch(KafkaException e) {
producer.abortTransaction();
}
Consumer should be transactional
consumerProps.put("isolation.level", "read_committed");
This ensures that consumer don't read any transactional messages before the transaction completes.
Manually commit offset in consumer
As explained here
Process record and save offsets atomically
Say by atomically saving both record processing output and offsets to any database. For this we need to set auto commit of database connection to false and manually commit after persisting both processing output and offset. This also requires setting enable.auto.commit to false.
Read initial offset (say to read after recovery from cache) from database
Seek consumer to this offset and then read from that position.
Doubts I have:
(Some doubts might be primary and can be resolved by implementing code. But I want words from experienced kafka developer.)
Does the consumer need to read the offset from database only for initial (/ first after consumer recovery) read or for all reads? I feel it needs to read offset from database only on restarts, as explained here
Do we have to opt for manual partitioning? Does this approach works only with auto partitioning off? I have this doubt because this example explains storing offset in MySQL by specifying partitions explicitly.
Do we need both: Producer side kafka transactions and consumer side database transactions (for storing offset and processing records atomically)? I feel for producer idempotence, we need producer to have unique transaction id and for that we need to use kafka transactional api (init, begin, commit). And as a counterpart, consumer also need to set isolation.level to read_committed. However can we ensure no message loss and duplicate processing without using kafka transactions? Or they are absolutely necessary?
Should we persist offset to external db as explained above and here
or send offset to transaction as explained here (also I didnt get what does it exactly mean by sending offset to transaction)
or follow sync async commit combo explained here.
I feel message loss / duplication scenarios 1 and 2 are handled by points 1 to 4 of approach I explained above.
I feel message loss / duplication scenario 3 is handled by point 6 of approach I explained above.
How do we implement different consumer consistency approaches as stated in message loss / duplication scenario 4? Is their any configuration or it needs to be implemented inside custom logic inside consumer?
Message loss / duplication scenario 5 says: "Kafka provides below guarantees as long as you are producing to one partition and consuming from one partition."? Is it something to concern about while building correct system?
Is any consideration unnecessary/redundant in the approach I came up with above? Also did I miss any necessary consideration? Did I miss any message loss / duplication scenarios?
Is their any other standard / recommended / preferable approach to ensure no message loss and duplicate processing than what I have thought above?
Do I have to actually code above approach using kafka APIs? or is there any high level API built atop kafka API which allows to easily ensure no message loss and duplicate processing?
Looking at issue we were facing (as stated at very beginning), we were thinking if we can recover any lost/unprocessed messages from files in which kafka stores messages. However that isnt correct, right?
(Extremely sorry for such an exhaustive post but wanted to write question which will ask all related question at one place allowing to build big picture of how to build system around kafka.)

What atomicity guarantees - if any - does Kafka have regarding batch writes?

We're now moving one of our services from pushing data through legacy communication tech to Apache Kafka.
The current logic is to send a message to IBM MQ and retry if errors occur. I want to repeat that, but I don't have any idea about what guarantees the broker provide in that scenario.
Let's say I send 100 messages in a batch via producer via Java client library. Assuming it reaches the cluster, is there a possibility only part of it be accepted (e.g. a disk is full, or some partitions I touch in my write are under-replicated)? Can I detect that problem from my producer and retry only those messages that weren't accepted?
I searched for kafka atomicity guarantee but came up empty, may be there's a well-known term for it
When you say you send 100 messages in one batch, you mean, you want to control this number of messages or be ok letting the producer batch a certain amount of messages and then send the batch ?
Because not sure you can control the number of produced messages in one producer batch, the API will queue them and batch them for you, but without guarantee of batch them all together ( I'll check that though).
If you're ok with letting the API batch a certain amount of messages for you, here is some clues about how they are acknowledged.
When dealing with producer, Kafka comes with some kind of reliability regarding writes ( also "batch writes")
As stated in this slideshare post :
https://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign (83)
The original list of messages is partitioned (randomly if the default partitioner is used) based on their destination partitions/topics, i.e. split into smaller batches.
Each post-split batch is sent to the respective leader broker/ISR (the individual send()’s happen sequentially), and each is acked by its respective leader broker according to request.required.acks
So regarding atomicity.. Not sure the whole batch will be seen as atomic regarding the above behavior. Maybe you can assure to send your batch of message using the same key for each message as they will go to the same partition, and thus maybe become atomic
If you need more clarity about acknowlegment rules when producing, here how it works As stated here https://docs.confluent.io/current/clients/producer.html :
You can control the durability of messages written to Kafka through the acks setting.
The default value of "1" requires an explicit acknowledgement from the partition leader that the write succeeded.
The strongest guarantee that Kafka provides is with "acks=all", which guarantees that not only did the partition leader accept the write, but it was successfully replicated to all of the in-sync replicas.
You can also look around producer enable.idempotence behavior if you aim having no duplicates while producing.
Yannick