How does Kafka guarantee consumers doesn't read a single message twice? - apache-kafka

How does Kafka guarantee consumers doesn't read a single message twice?
Or is the above scenario possible?
Could the same message be read twice by single or by multiple consumers?

There are many scenarios which cause Consumer to consume the duplicate message
Producer published the message successfully but failed to acknowledge which cause to retry the same message
Producer publishing a batch of the message but failed partially published messages. In that case, it will retry and resent the same batch again which will cause duplicate
Consumers receive a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false).
If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
To guarantee not to consume duplicate messages the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side.
You can use the below parameter to achieve exactly one semantic. But please you have understood this comes with a compromise with performance.
enable idempotence on the producer side which will guarantee not to publish the same message twice
enable.idempotence=true
Defined Transaction (isolation.level) is read_committed
isolation.level=read_committed
In Kafka Stream above setting can be achieved by setting Exactly-Once
semantic true to make it as unit transaction
Idempotent
Idempotent delivery enables producers to write messages to Kafka exactly once to a particular partition of a topic during the lifetime of a single producer without data loss and order per partition.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure) Consumer uses isolation. level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
Please refer more in detail refrence

It is absolutely possible if you don't make your consume process idempotent.
For example; you are implementing at-least-one delivery semantic and firstly process messages and then commit offsets. It is possible to couldn't commit offsets because of server failure or rebalance. (maybe your consumer is revoked at that time) So when you poll you will get same messages twice.

To be precise, this is what Kafka guarantees:
Kafka provides order guarantee of messages in a partition
Produced messages are considered "committed" when they were written to the partition on all its in-sync replicas
Messages that are committed will not be losts as long as at least one replica remains alive
Consumers can only read messages that are committed
Regarding consuming messages, the consumers keep track of their progress in a partition by saving the last offset read in an internal compacted Kafka topic.
Kafka consumers can automatically commit the offset if enable.auto.commit is enabled. However, that will give "at most once" semantics. Hence, usually the flag is disabled and the developer commits the offset explicitly once the processing is complete.

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

Can we apply Kafka exactly-once semantics in read-process scenario?

How can we make sure Kafka exactly-once semantics in read-process scenario. read means we are reading from Kafka topic and doing some processing and then we are trying to commit the offset.
Lets suppose, we processed the messages but could not able to commit and before commit the process crashed. after restart, again trying to consume the same message. so how to handle such scenarios? Can this be handled with Kafka Transaction APIs?
There is similar question but not able to understand it properly and left few comments there as well. Just wanted to confirm my understanding.
Confused about Kafka exactly-once semantics
Kafka Transaction offers EOS for consume-process-produce scenarios. This exactly once process works by committing the offsets by producers instead of consumer. i.e., the produce of result to kafka and committing the consumed messages all are done by kafka producer (instead of separate kafka consumer and producer) which brings the exactly once. The EOS in kafka transaction ensures that for each consumed message we have exactly one result (the result may contain multiple messages) on the kafka, but the message could be processed multiple times in failure scenarios.
So you cannot achieve exactly once in read-process. The only solution you can use is to make your messages idempotence and change your business logic somehow that duplicate messages do not have side effect. e.g.:
-Using deduplicate process if you use database and check the duplicate value before insert or process and drop the incoming message.
-In some scenarios that duplicates affect you database, we can commit the offsets to database and by that make the data insertions and offset commits in one transaction.

Kafka Commit: Producer vs Consumer

It's been a couple of months that I am learning Kafka, and
I keep seeing the word "commit" come up in both producer as well as consumer contexts. It was confusing to me for a long time, but I think I have a better understanding now.
Would be great if someone could validate my understanding, or correct me if I am wrong/missing something in my below understanding:
commit in Producer:
Commit comes up in a producer context only when we are dealing with transactions. Here a commit means that a transactional producer has been able to successfully write a message to a partition in a topic.
commit in Consumer:
Kafka does not itself automatically track which consumer has read which message. A consumer needs to notify the broker that it has read a particular message in a topic. This acknowledgment process, by which a consumer notifies which message/partition in a topic it has read successfully (so that other consumers don't re-read that again) is known a "commit".
From the book Kafka: The definitive guide:
How does a consumer commit an offset? It produces a message to Kafka,
to a special __consumer_offsets topic, with the committed offset for
each partition
Also, another area where the word "commit" comes up in a Consumer setting is the "isolation.level" of a consumer, ie isolation.level=read_committed. This is however only in a transactional setting. When we are using a Transactional Producer, this isolation.level of the consumer will specify if it will read messages after they are "committed" by the producer or not. More details here
Again, would be great if someone could validate my understanding.
If my understanding is correct, a Producer commit does not only mean that a transactional producer has been able to successfully write records to a topic, but it also means that the consumer involved in the transaction (in other words, in the atomic read-process-write cycle) has also been able to successfully consume records from a topic.
Before calling producer.commitTransaction(), one should call producer.sendOffsetsToTransaction(Map<TopicPartition,OffsetAndMetadata> offsets, String consumerGroupId) so that the broker knows which records were consumed by the consumer during an atomic read-process-write cycle.
The following code shows a common processing pattern (read-process-write loop) when dealing with transactions :
while (true) {
ConsumerRecords records = consumer.poll(timeout);
producer.beginTransaction();
for (ConsumerRecord record : records)
producer.send(producerRecord(“outputTopic”, record));
producer.sendOffsetsToTransaction(currentOffsets(consumer), group);
producer.commitTransaction();
}
In short, in a transactional context, it is also the duty of the producer to "commit for the consumer".
More information about transactions can be found in this article

Kafka excatly-once producer consumer

I am implementing Exactly-once semantics for a simple data pipeline, with Kafka as message broker. I can force Kafka producer to write each produced record exactly once by setting set enable.idempotence=true.
However, on the consumption front I need to guarantee that the consumer reads each record exactly once (I am not interested in storing the consumed records to external system or to another Kafka topic just processing). To achieve this, I have to ensure that polled records are processed and their offsets are committed to __consumer_offsets topic atomically/transactionally (both succeed/fail together).
In such case do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop, where inside the transaction I perform: (1) processing of the consumed records and (2) committing their offsets, before closing the transaction. Would the normal commitSync/commitAsync serve in such case?
"on the consumption front I need to guarantee that the consumer reads each record exactly once"
The answer from Gopinath explains well how you can achieve exactly-once between a KafkaProducer and KafkaConsumer. These configurations (together with the application of Transaction API in the KafkaProducer) guarantees that all data send by the producer will be stored in Kafka exactly once. However, it does not guarantee that the Consumer is reading the data exactly once. This, of course, depends on your offset management.
Anyway, I understand your question that you want to know how the Consumer itself is processing a consumed message exactly once.
For this you need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally.
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
"do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop"
The Transaction API is only available for KafkaProducers and as far as I am aware cannot be used for your offset management.
'Exactly-once' functionality in Kafka can be achieved by a combination of these 3 settings:
isolation.level = read_committed
transactional.id = <unique_id>
processing.guarantee = exactly_once
More information on enabling the exactly-once functionality:
https://www.confluent.io/blog/simplified-robust-exactly-one-semantics-in-kafka-2-5/
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Difference between idempotence and exactly-once in Kafka Stream

I was going through document what I understood we can achieve exactly-once transaction with enabling idempotence=true
idempotence: The Idempotent producer enables exactly once for a
producer against a single topic. Basically each single message send
has stonger guarantees and will not be duplicated in case there's an
error
So if already we have idempotence then why we need another property exactly-once in Kafka Stream? What exactly different between idempotence vs exactly-once
Why exactly-once property not available in normal Kafka Producer?
In a distributed environment failure is a very common scenario that can be happened any time. In the Kafka environment, the broker can crash, network failure, failure in processing, failure while publishing message or failure to consume messages, etc.
These different scenarios introduced different kinds of data loss and duplication.
Failure scenarios
A(Ack Failed): Producer published message successfully with retry>1 but could not receive acknowledge due to failure. In that case, the Producer will retry the same message that might introduce duplicate.
B(Producer process failed in batch messages): Producer sending a batch of messages it failed with few published success. In that case and once the producer will restart it will again republish all messages from the batch which will introduce duplicate in Kafka.
C(Fire & Forget Failed) Producer published message with retry=0(fire and forget). In case of failure published will not aware and send the next message this will cause the message lost.
D(Consumer failed in batch message) A consumer receives a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false). If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
Exactly-Once semantics
In this case, even if a producer tries to resend a message, it leads
to the message will be published and consumed by consumers exactly once.
To achieve Exactly-Once semantic in Kafka, it uses below 3 property
enable.idempotence=true (address a, b & c)
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5(Producer will always have one in-flight request per connection)
isolation.level=read_committed (address d )
Enable Idempotent(enable.idempotence=true)
Idempotent delivery enables the producer to write a message to Kafka exactly
once to a particular partition of a topic during the lifetime of a
single producer without data loss and order per partition.
"Note that enabling idempotence requires MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION to be less than or equal to 5, RETRIES_CONFIG to be greater than 0 and ACKS_CONFIG be 'all'. If these values are not explicitly set by the user, suitable values will be chosen. If incompatible values are set, a ConfigException will be thrown"
To achieve idempotence Kafka uses a unique id which is called product id or PID and sequence number while producing messages. The producer keeps incrementing the sequence number on each message published which map with unique PID. The broker always compare the current sequence number with the previous one and it rejects if the new one is not +1 greater than the previous one which avoids duplication and same time if more than greater show lost in messages
In a failure scenario broker will compare the sequence numbers with the previous one and if the sequence not increased +1 will reject the message.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure)
Consumer uses isolation.level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
If a consumer with isolation.level=read_committed reaches a control message for a transaction that has not completed, it will not deliver any more messages from this partition until the producer commits or aborts the transaction or a transaction timeout occurs. The transaction timeout is determined by the producer using the configuration transaction.timeout.ms(default 1 minute).
Exactly-Once in Producer & Consumer
In normal conditions where we have separate producers and consumers. The producer has to idempotent and same time manage transactions so consumers can use isolation.level to read-only read_committed to make the whole process as an atomic operation.
This makes a guarantee that the producer will always sync with the source system. Even producer crash or a transaction aborted, it always is consistent and publishes a message or batch of the message as a unit once.
The same consumer will either receive a message or batch of the message as a unit once.
In Exactly-Once semantic Producer along with Consumer will appear as
atomic operation which will operate as one unit. Either publish and
get consumed once at all or aborted.
Exactly Once in Kafka Stream
Kafka Stream consumes messages from topic A, process and publish a message to Topic B and once publish use commit(commit mostly run undercover) to flush all state store data to disk.
Exactly-once in Kafka Stream is a read-process-write pattern that guarantees that this operation will be treated as an atomic operation. Since Kafka Stream caters producer, consumer and transaction all together Kafka Stream comes special parameter processing.guarantee which could exactly_once or at_least_once which make life easy not to handle all parameters separately.
Kafka Streams atomically updates consumer offsets, local state stores,
state store changelog topics, and production to output topics all
together. If anyone of these steps fails, all of the changes are
rolled back.
processing.guarantee: exactly_once automatically provide below parameters you no need to set explicitly
isolation.level=read_committed
enable.idempotence=true
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5
Kafka stream offers the exactly-once semantic from the end-to-end point of view (consumes from one topic, processes that message, then produces to another topic). However, you mentioned only the producer's idempotent attribute. That is only a small part of the full picture.
Let me rephrase the question:
Why do we need the exactly-once delivery semantic at the consumer side
while we already have guaranteed the exactly-once delivery semantic at the
producer side?
Answer: Since the exactly-once delivery semantic is not only at the producing step but the full flow of processing. To achieve the exactly-once delivery semantically, there are some conditions must be satisfied with the producing and consuming.
This is the generic scenario: Process A produces messages to the topic T. At the same time, process B tries to consume messages from the topic T. We want to make sure process B never processes one message twice.
Producer part: We must make sure that producers never produce a message twice. We can use Kafka Idempotent Producer
Consumer part:
Here is the basic workflow for the consumer:
Step 1: The consumer pulls the message M successfully from the Kafka's topic.
Step 2: The consumer tries to execute the job and the job returns successfully.
Step 3: The consumer commits the message's offset to the Kafka brokers.
The above steps are just a happy path. There are many issues arises in reality.
Scenario 1: The job on step 2 executes successfully but then the consumer is crashed. Since this unexpected circumstance, the consumer has not committed the message's offset yet. When the consumer restarts, the message will be consumed twice.
Scenario 2: While the consumer commits the offset at step 3, it crashes due to hardware failures (e.g: CPU, memory violation, ...) When restarting, the consumer no way to know it has committed the offset successfully or not.
Because there are many problems might be happened, the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side. It doesn't mean we cannot but it takes a lot of effort to make sure the exactly-once delivery semantic. Kafka Stream upholds the work for engineers.
Noted that: Kafka Stream offers "exactly-once stream processing". It refers to consuming from a topic, materializing intermediate state in a Kafka topic and producing to one. If our application depends on some other external services (database, services...), we must make sure our external dependencies can guarantee exactly-once in those cases.
TL,DR: exactly-once for the full flow needs the cooperation between producers and consumers.
References:
Exactly-once semantics and how Apache Kafka does it
Transactions in Apache Kafka
Enabling exactly once Kafka streams