Difference between kafka idempotent and transactional producer setup? - apache-kafka

When setting up a kafka producer to use idempotent behaviour, and transactional behaviour:
I understand that for idempotency we set:
enable.idempotence=true
and that by changing this one flag on our producer, we are guaranteed exactly-once event delivery?
and for transactions, we must go further and set the transaction.id=<some value>
but by setting this value, it also sets idempotence to true?
Also, by setting one or both of the above to true, the producer will also set acks=all.
With the above should I be able to add 'exactly once delivery' by simply changing the enable idempotency setting? If i wanted to go further and enable transactional support, On the Consumer side, I would only need to change their setting, isolation.level=read_committed? Does this image reflect how to setup the producer in terms of EOS?

Yes you understood the main concepts.
By enabling idempotence, the producer automatically sets acks to all and guarantees message delivery for the lifetime of the Producer instance.
By enabling transactions, the producer automatically enables idempotence (and acks=all). Transactions allow to group produce requests and offset commits and ensure all or nothing gets committed to Kafka.
When using transactions, you can configure if consumers should only see records from committed transactions by setting isolation.level to read_committed, otherwise by default they see all records including from discarded transactions.

Actually idemnpotency by itself does not always guarantee exactly once event delivery. Let's say you have a consumer that consumes an event, processes it and produces an event. Somewhere in this process the offset that the consumer uses must be incremented and persisted. Without a transactional producer, if it happens before the producer sends a message, the message might not be sent and its at most once delivery. If you do it after the message is sent you might fail in persisting the offset and then the consumer would read the same message again and the producer would send a duplicate, you get an at least once delivery. The all or nothing mechanism of a transactional producer prevents this scenario given that you store your offset on kafka, the new message and the incrementation of the offset of the consumer becomes an atomic action.

Related

Kafka exaclty one delivery semantics on error scenarios

In the documentation is stated
From Kafka 0.11, the KafkaProducer supports two additional modes: the idempotent producer
and the transactional producer. The idempotent producer strengthens Kafka's delivery
semantics from at least once to exactly once delivery.
...
To take advantage of the idempotent producer, it is imperative to avoid application
level re-sends since these cannot be de-duplicated. As such, if an application enables
idempotence, it is recommended to leave the retries config unset, as it will be defaulted
to Integer.MAX_VALUE. Additionally, if a send(ProducerRecord) returns an error even with
infinite retries (for instance if the message expires in the buffer before being sent),
then it is recommended to shut down the producer and check the contents of the last
produced message to ensure that it is not duplicated.
Finally, the producer can only guarantee idempotence for messages sent within a single session.
I don't exactly understand how to avoid application level resends in failure scenarios, particularly in the scenarios when ACKs are lost due to network error in combination with the producer app being down.
Would you be able to point me to the strategies used to ensure exactly once delivery ?

How does Kafka guarantee consumers doesn't read a single message twice?

How does Kafka guarantee consumers doesn't read a single message twice?
Or is the above scenario possible?
Could the same message be read twice by single or by multiple consumers?
There are many scenarios which cause Consumer to consume the duplicate message
Producer published the message successfully but failed to acknowledge which cause to retry the same message
Producer publishing a batch of the message but failed partially published messages. In that case, it will retry and resent the same batch again which will cause duplicate
Consumers receive a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false).
If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
To guarantee not to consume duplicate messages the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side.
You can use the below parameter to achieve exactly one semantic. But please you have understood this comes with a compromise with performance.
enable idempotence on the producer side which will guarantee not to publish the same message twice
enable.idempotence=true
Defined Transaction (isolation.level) is read_committed
isolation.level=read_committed
In Kafka Stream above setting can be achieved by setting Exactly-Once
semantic true to make it as unit transaction
Idempotent
Idempotent delivery enables producers to write messages to Kafka exactly once to a particular partition of a topic during the lifetime of a single producer without data loss and order per partition.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure) Consumer uses isolation. level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
Please refer more in detail refrence
It is absolutely possible if you don't make your consume process idempotent.
For example; you are implementing at-least-one delivery semantic and firstly process messages and then commit offsets. It is possible to couldn't commit offsets because of server failure or rebalance. (maybe your consumer is revoked at that time) So when you poll you will get same messages twice.
To be precise, this is what Kafka guarantees:
Kafka provides order guarantee of messages in a partition
Produced messages are considered "committed" when they were written to the partition on all its in-sync replicas
Messages that are committed will not be losts as long as at least one replica remains alive
Consumers can only read messages that are committed
Regarding consuming messages, the consumers keep track of their progress in a partition by saving the last offset read in an internal compacted Kafka topic.
Kafka consumers can automatically commit the offset if enable.auto.commit is enabled. However, that will give "at most once" semantics. Hence, usually the flag is disabled and the developer commits the offset explicitly once the processing is complete.

Difference between idempotence and exactly-once in Kafka Stream

I was going through document what I understood we can achieve exactly-once transaction with enabling idempotence=true
idempotence: The Idempotent producer enables exactly once for a
producer against a single topic. Basically each single message send
has stonger guarantees and will not be duplicated in case there's an
error
So if already we have idempotence then why we need another property exactly-once in Kafka Stream? What exactly different between idempotence vs exactly-once
Why exactly-once property not available in normal Kafka Producer?
In a distributed environment failure is a very common scenario that can be happened any time. In the Kafka environment, the broker can crash, network failure, failure in processing, failure while publishing message or failure to consume messages, etc.
These different scenarios introduced different kinds of data loss and duplication.
Failure scenarios
A(Ack Failed): Producer published message successfully with retry>1 but could not receive acknowledge due to failure. In that case, the Producer will retry the same message that might introduce duplicate.
B(Producer process failed in batch messages): Producer sending a batch of messages it failed with few published success. In that case and once the producer will restart it will again republish all messages from the batch which will introduce duplicate in Kafka.
C(Fire & Forget Failed) Producer published message with retry=0(fire and forget). In case of failure published will not aware and send the next message this will cause the message lost.
D(Consumer failed in batch message) A consumer receives a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false). If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
Exactly-Once semantics
In this case, even if a producer tries to resend a message, it leads
to the message will be published and consumed by consumers exactly once.
To achieve Exactly-Once semantic in Kafka, it uses below 3 property
enable.idempotence=true (address a, b & c)
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5(Producer will always have one in-flight request per connection)
isolation.level=read_committed (address d )
Enable Idempotent(enable.idempotence=true)
Idempotent delivery enables the producer to write a message to Kafka exactly
once to a particular partition of a topic during the lifetime of a
single producer without data loss and order per partition.
"Note that enabling idempotence requires MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION to be less than or equal to 5, RETRIES_CONFIG to be greater than 0 and ACKS_CONFIG be 'all'. If these values are not explicitly set by the user, suitable values will be chosen. If incompatible values are set, a ConfigException will be thrown"
To achieve idempotence Kafka uses a unique id which is called product id or PID and sequence number while producing messages. The producer keeps incrementing the sequence number on each message published which map with unique PID. The broker always compare the current sequence number with the previous one and it rejects if the new one is not +1 greater than the previous one which avoids duplication and same time if more than greater show lost in messages
In a failure scenario broker will compare the sequence numbers with the previous one and if the sequence not increased +1 will reject the message.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure)
Consumer uses isolation.level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
If a consumer with isolation.level=read_committed reaches a control message for a transaction that has not completed, it will not deliver any more messages from this partition until the producer commits or aborts the transaction or a transaction timeout occurs. The transaction timeout is determined by the producer using the configuration transaction.timeout.ms(default 1 minute).
Exactly-Once in Producer & Consumer
In normal conditions where we have separate producers and consumers. The producer has to idempotent and same time manage transactions so consumers can use isolation.level to read-only read_committed to make the whole process as an atomic operation.
This makes a guarantee that the producer will always sync with the source system. Even producer crash or a transaction aborted, it always is consistent and publishes a message or batch of the message as a unit once.
The same consumer will either receive a message or batch of the message as a unit once.
In Exactly-Once semantic Producer along with Consumer will appear as
atomic operation which will operate as one unit. Either publish and
get consumed once at all or aborted.
Exactly Once in Kafka Stream
Kafka Stream consumes messages from topic A, process and publish a message to Topic B and once publish use commit(commit mostly run undercover) to flush all state store data to disk.
Exactly-once in Kafka Stream is a read-process-write pattern that guarantees that this operation will be treated as an atomic operation. Since Kafka Stream caters producer, consumer and transaction all together Kafka Stream comes special parameter processing.guarantee which could exactly_once or at_least_once which make life easy not to handle all parameters separately.
Kafka Streams atomically updates consumer offsets, local state stores,
state store changelog topics, and production to output topics all
together. If anyone of these steps fails, all of the changes are
rolled back.
processing.guarantee: exactly_once automatically provide below parameters you no need to set explicitly
isolation.level=read_committed
enable.idempotence=true
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5
Kafka stream offers the exactly-once semantic from the end-to-end point of view (consumes from one topic, processes that message, then produces to another topic). However, you mentioned only the producer's idempotent attribute. That is only a small part of the full picture.
Let me rephrase the question:
Why do we need the exactly-once delivery semantic at the consumer side
while we already have guaranteed the exactly-once delivery semantic at the
producer side?
Answer: Since the exactly-once delivery semantic is not only at the producing step but the full flow of processing. To achieve the exactly-once delivery semantically, there are some conditions must be satisfied with the producing and consuming.
This is the generic scenario: Process A produces messages to the topic T. At the same time, process B tries to consume messages from the topic T. We want to make sure process B never processes one message twice.
Producer part: We must make sure that producers never produce a message twice. We can use Kafka Idempotent Producer
Consumer part:
Here is the basic workflow for the consumer:
Step 1: The consumer pulls the message M successfully from the Kafka's topic.
Step 2: The consumer tries to execute the job and the job returns successfully.
Step 3: The consumer commits the message's offset to the Kafka brokers.
The above steps are just a happy path. There are many issues arises in reality.
Scenario 1: The job on step 2 executes successfully but then the consumer is crashed. Since this unexpected circumstance, the consumer has not committed the message's offset yet. When the consumer restarts, the message will be consumed twice.
Scenario 2: While the consumer commits the offset at step 3, it crashes due to hardware failures (e.g: CPU, memory violation, ...) When restarting, the consumer no way to know it has committed the offset successfully or not.
Because there are many problems might be happened, the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side. It doesn't mean we cannot but it takes a lot of effort to make sure the exactly-once delivery semantic. Kafka Stream upholds the work for engineers.
Noted that: Kafka Stream offers "exactly-once stream processing". It refers to consuming from a topic, materializing intermediate state in a Kafka topic and producing to one. If our application depends on some other external services (database, services...), we must make sure our external dependencies can guarantee exactly-once in those cases.
TL,DR: exactly-once for the full flow needs the cooperation between producers and consumers.
References:
Exactly-once semantics and how Apache Kafka does it
Transactions in Apache Kafka
Enabling exactly once Kafka streams

Idempotent and Transactions

I am exploring Transactions in Kafka, and I want to understand all the details.
I noticed in Spring-Kafka that idempotent is enabled when you provide a transactionsalId.
public void setTransactionIdPrefix(String transactionIdPrefix) {
Assert.notNull(transactionIdPrefix, "'transactionIdPrefix' cannot be null");
this.transactionIdPrefix = transactionIdPrefix;
enableIdempotentBehaviour();
}
At first glance, I assume Spring-Kafka enabled idempotent in transactions because it is "good-to-have". I assumed it was to ensure to ensure exactly-once semantics in transactions.
I did a bit more digging and discovered that idempotent is required for transactions to work. This is mentioned in KIP-98
Note that enable.idempotence must be enabled if a TransactionalId is
configured.
Kafka idempotent is a feature to avoid duplicated messages, such as network errors after the message has been sent.
My understanding is that, Kafka transactions basically writes to an internal topic and idempotent has to be enabled to avoid duplicates.
Idempotent enables exactly-once semantics for producers.
Transactions enables exactly-once semantics for transitivity; consume -> produce.
Is my understanding correct?
What enables exactly-once for only consumers? Committing offset, idempotent, or transactions.
The Idempotent producer enables exactly once for a producer against a single topic. Basically each single message send has stonger guarantees and will not be duplicated in case there's an error.
The Transactional producer on the other hand enables to group a number of send (that can be across many partitions) together and have all of them (or none) applied. Transactions can also contain offset commits (in the end commiting offsets is the same as writing to a topic).
Because Consumers fetch data from Kafka, it's sort of already exactly once. When the consumer asks Kafka messages from offset N, if it does not receives them, it will just retry, there can't be any duplication. The only exactly once need for COnsumers is for committing offsets and that can be done by the Transactional Producer (The consumer needs to pass its current offsets to the Producer).

Making Kafka consumers consume existing messages before subscription

Having Publisher and N Consumers, if consumers use auto.offset.reset=latest then they miss all the messages that were published to a topic before they subscribed to it ... It is a known fact that Consumer with auto.offset.reset=latestdoesn't replay messages that existed in the topic before it subscribed...
So I would need either :
Make publisher wait until all subscribers start consuming messages and then start publishing. Dunno how to do that without leveraging Zookeeper for instance. Does Kafka provide means to do that ?
Another way would be having auto.offset.reset=latest Consumers and make them explicitly consume all existing messages before in case they are about to subscribe to a topic with existing messages...
What is the best practice for this case?
I guess that consumer must check topic for existing messages, consume them if there are any and then initiate auto.offset.reset=latest consumption. That sounds like the best way to me ...
If a high level consumer gets started, it does the following:
look for committed offsets for its consumer group
a. if valid offsets are found, resume from there
b. if no valid offsets are found, set offsets according to auto.offset.reset
Thus, auto.offset.reset only triggers, if no valid offset was committed. This behavior is intended and necessary to provide at-least-once processing guarantees in case of failure.
Thus, is you want to read a topic from its beginning, you can either use a new consumer group.id and set auto.offset.reset = earliest or you explicitly modify the offsets on startup using seekToBeginning() before you start your poll() loop.
We do the option (1) using a service discovery feature provided by Eureka (any other service discovery app would do the job) + aliasing. Basically a publisher does not register itself (and start processing requests nor publish notifications) until at least one subscriber is available.