Kafka excatly-once producer consumer - apache-kafka

I am implementing Exactly-once semantics for a simple data pipeline, with Kafka as message broker. I can force Kafka producer to write each produced record exactly once by setting set enable.idempotence=true.
However, on the consumption front I need to guarantee that the consumer reads each record exactly once (I am not interested in storing the consumed records to external system or to another Kafka topic just processing). To achieve this, I have to ensure that polled records are processed and their offsets are committed to __consumer_offsets topic atomically/transactionally (both succeed/fail together).
In such case do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop, where inside the transaction I perform: (1) processing of the consumed records and (2) committing their offsets, before closing the transaction. Would the normal commitSync/commitAsync serve in such case?

"on the consumption front I need to guarantee that the consumer reads each record exactly once"
The answer from Gopinath explains well how you can achieve exactly-once between a KafkaProducer and KafkaConsumer. These configurations (together with the application of Transaction API in the KafkaProducer) guarantees that all data send by the producer will be stored in Kafka exactly once. However, it does not guarantee that the Consumer is reading the data exactly once. This, of course, depends on your offset management.
Anyway, I understand your question that you want to know how the Consumer itself is processing a consumed message exactly once.
For this you need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally.
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
"do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop"
The Transaction API is only available for KafkaProducers and as far as I am aware cannot be used for your offset management.

'Exactly-once' functionality in Kafka can be achieved by a combination of these 3 settings:
isolation.level = read_committed
transactional.id = <unique_id>
processing.guarantee = exactly_once
More information on enabling the exactly-once functionality:
https://www.confluent.io/blog/simplified-robust-exactly-one-semantics-in-kafka-2-5/
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

Can we apply Kafka exactly-once semantics in read-process scenario?

How can we make sure Kafka exactly-once semantics in read-process scenario. read means we are reading from Kafka topic and doing some processing and then we are trying to commit the offset.
Lets suppose, we processed the messages but could not able to commit and before commit the process crashed. after restart, again trying to consume the same message. so how to handle such scenarios? Can this be handled with Kafka Transaction APIs?
There is similar question but not able to understand it properly and left few comments there as well. Just wanted to confirm my understanding.
Confused about Kafka exactly-once semantics
Kafka Transaction offers EOS for consume-process-produce scenarios. This exactly once process works by committing the offsets by producers instead of consumer. i.e., the produce of result to kafka and committing the consumed messages all are done by kafka producer (instead of separate kafka consumer and producer) which brings the exactly once. The EOS in kafka transaction ensures that for each consumed message we have exactly one result (the result may contain multiple messages) on the kafka, but the message could be processed multiple times in failure scenarios.
So you cannot achieve exactly once in read-process. The only solution you can use is to make your messages idempotence and change your business logic somehow that duplicate messages do not have side effect. e.g.:
-Using deduplicate process if you use database and check the duplicate value before insert or process and drop the incoming message.
-In some scenarios that duplicates affect you database, we can commit the offsets to database and by that make the data insertions and offset commits in one transaction.

Is message deduplication essential on the Kafka consumer side?

Kafka documentation states the following as the top scenario:
To process payments and financial transactions in real-time, such as
in stock exchanges, banks, and insurances
Also, regarding the main concepts, right at the very top:
Kafka provides various guarantees such as the ability to process
events exactly-once.
It’s funny the document says:
Many systems claim to provide "exactly once" delivery semantics, but
it is important to read the fine print, most of these claims are
misleading…
It seems obvious that payments/financial transactions must be processed „exactly-once“, but the rest of Kafka documentation doesn't make it obvious how this should be accomplished.
Let’s focus on the producer/publisher side:
If a producer attempts to publish a message and experiences a network
error it cannot be sure if this error happened before or after the
message was committed. This is similar to the semantics of inserting
into a database table with an autogenerated key. … Since 0.11.0.0, the
Kafka producer also supports an idempotent delivery option which
guarantees that resending will not result in duplicate entries in the
log.
KafkaProducer only ensures that it doesn’t incorrectly resubmit messages (resulting in duplicates) itself. Kafka cannot cover the case where client app code crashes (along with KafkaProducer) and it is not sure if it previously invoked send (or commitTransaction in case of transactional producer) which means that application-level retry will result in duplicate processing.
Exactly-once delivery for other destination systems generally
requires cooperation with such systems, but Kafka provides the offset
which makes implementing this feasible (see also Kafka Connect).
The above statement is only partially correct, meaning that while it exposes offsets on the Consumer side, it doesn’t make exactly-once feasible at all on the producer side.
Kafka consume-process-produce loop enables exactly-once processing leveraging sendOffsetsToTransaction, but again cannot cover the case of the possibility of duplicates on the first producer in the chain.
The provided official demo for EOS (Exactly once semantics) only provides an example for consume-process-produce EOS.
Solutions involving DB transaction log readers which read already committed transactions, also cannot be sure if they will produce duplicate messages in case they crash.
There is no support for a distributed transaction (XA) involving a database and the Kafka producer.
Does all of this mean that in order to ensure exactly once processing for payments and financial transactions (Kafka top use case!), we absolutely must perform business-level message deduplication on the consumer side, inspite of the Kafka transport-level “guarantees”/claims?
Note: I’m aware of:
Kafka Idempotent producer
but I would like a clear answer if deduplication is inevitable on the consumer side.
You must deduplicate on consumer side since rebalance on consumer side can really cause processing of events more than once in a consumer group based on fetch size and commit interval parameters.
If a consumer exits without acknowledging back to broker, Kafka will assign those events to another consumer in the group. Example if you are pulling a batch size of 5 events, if consumer dies or goes for a restart after processing first 3(If the external api/db fails OR the worse case your server runs out of memory and crashes), the current consumer dies abruptly without making a commit back/ack to broker. Hence the same batch gets assigned to another consumer from group(rebalance) where it starts supplies the same event batch again which will result in re-processing of same set of records resulting in duplication. A good read here : https://quarkus.io/blog/kafka-commit-strategies/
You can make use of internal state store of Kafka for deduplication. Here there is no offset/partition tracking, its kind of cache(persistent time bound on cluster).
In my case we push correlationId(a unique business identifier in incoming event) into it on successful processing of events, and all new events are checked against this before processing to make sure its not a duplicate event. Enabling state store will create more internal topics in Kafka cluster, just an FYI.
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html#state-stores

Difference between idempotence and exactly-once in Kafka Stream

I was going through document what I understood we can achieve exactly-once transaction with enabling idempotence=true
idempotence: The Idempotent producer enables exactly once for a
producer against a single topic. Basically each single message send
has stonger guarantees and will not be duplicated in case there's an
error
So if already we have idempotence then why we need another property exactly-once in Kafka Stream? What exactly different between idempotence vs exactly-once
Why exactly-once property not available in normal Kafka Producer?
In a distributed environment failure is a very common scenario that can be happened any time. In the Kafka environment, the broker can crash, network failure, failure in processing, failure while publishing message or failure to consume messages, etc.
These different scenarios introduced different kinds of data loss and duplication.
Failure scenarios
A(Ack Failed): Producer published message successfully with retry>1 but could not receive acknowledge due to failure. In that case, the Producer will retry the same message that might introduce duplicate.
B(Producer process failed in batch messages): Producer sending a batch of messages it failed with few published success. In that case and once the producer will restart it will again republish all messages from the batch which will introduce duplicate in Kafka.
C(Fire & Forget Failed) Producer published message with retry=0(fire and forget). In case of failure published will not aware and send the next message this will cause the message lost.
D(Consumer failed in batch message) A consumer receives a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false). If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
Exactly-Once semantics
In this case, even if a producer tries to resend a message, it leads
to the message will be published and consumed by consumers exactly once.
To achieve Exactly-Once semantic in Kafka, it uses below 3 property
enable.idempotence=true (address a, b & c)
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5(Producer will always have one in-flight request per connection)
isolation.level=read_committed (address d )
Enable Idempotent(enable.idempotence=true)
Idempotent delivery enables the producer to write a message to Kafka exactly
once to a particular partition of a topic during the lifetime of a
single producer without data loss and order per partition.
"Note that enabling idempotence requires MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION to be less than or equal to 5, RETRIES_CONFIG to be greater than 0 and ACKS_CONFIG be 'all'. If these values are not explicitly set by the user, suitable values will be chosen. If incompatible values are set, a ConfigException will be thrown"
To achieve idempotence Kafka uses a unique id which is called product id or PID and sequence number while producing messages. The producer keeps incrementing the sequence number on each message published which map with unique PID. The broker always compare the current sequence number with the previous one and it rejects if the new one is not +1 greater than the previous one which avoids duplication and same time if more than greater show lost in messages
In a failure scenario broker will compare the sequence numbers with the previous one and if the sequence not increased +1 will reject the message.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure)
Consumer uses isolation.level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
If a consumer with isolation.level=read_committed reaches a control message for a transaction that has not completed, it will not deliver any more messages from this partition until the producer commits or aborts the transaction or a transaction timeout occurs. The transaction timeout is determined by the producer using the configuration transaction.timeout.ms(default 1 minute).
Exactly-Once in Producer & Consumer
In normal conditions where we have separate producers and consumers. The producer has to idempotent and same time manage transactions so consumers can use isolation.level to read-only read_committed to make the whole process as an atomic operation.
This makes a guarantee that the producer will always sync with the source system. Even producer crash or a transaction aborted, it always is consistent and publishes a message or batch of the message as a unit once.
The same consumer will either receive a message or batch of the message as a unit once.
In Exactly-Once semantic Producer along with Consumer will appear as
atomic operation which will operate as one unit. Either publish and
get consumed once at all or aborted.
Exactly Once in Kafka Stream
Kafka Stream consumes messages from topic A, process and publish a message to Topic B and once publish use commit(commit mostly run undercover) to flush all state store data to disk.
Exactly-once in Kafka Stream is a read-process-write pattern that guarantees that this operation will be treated as an atomic operation. Since Kafka Stream caters producer, consumer and transaction all together Kafka Stream comes special parameter processing.guarantee which could exactly_once or at_least_once which make life easy not to handle all parameters separately.
Kafka Streams atomically updates consumer offsets, local state stores,
state store changelog topics, and production to output topics all
together. If anyone of these steps fails, all of the changes are
rolled back.
processing.guarantee: exactly_once automatically provide below parameters you no need to set explicitly
isolation.level=read_committed
enable.idempotence=true
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5
Kafka stream offers the exactly-once semantic from the end-to-end point of view (consumes from one topic, processes that message, then produces to another topic). However, you mentioned only the producer's idempotent attribute. That is only a small part of the full picture.
Let me rephrase the question:
Why do we need the exactly-once delivery semantic at the consumer side
while we already have guaranteed the exactly-once delivery semantic at the
producer side?
Answer: Since the exactly-once delivery semantic is not only at the producing step but the full flow of processing. To achieve the exactly-once delivery semantically, there are some conditions must be satisfied with the producing and consuming.
This is the generic scenario: Process A produces messages to the topic T. At the same time, process B tries to consume messages from the topic T. We want to make sure process B never processes one message twice.
Producer part: We must make sure that producers never produce a message twice. We can use Kafka Idempotent Producer
Consumer part:
Here is the basic workflow for the consumer:
Step 1: The consumer pulls the message M successfully from the Kafka's topic.
Step 2: The consumer tries to execute the job and the job returns successfully.
Step 3: The consumer commits the message's offset to the Kafka brokers.
The above steps are just a happy path. There are many issues arises in reality.
Scenario 1: The job on step 2 executes successfully but then the consumer is crashed. Since this unexpected circumstance, the consumer has not committed the message's offset yet. When the consumer restarts, the message will be consumed twice.
Scenario 2: While the consumer commits the offset at step 3, it crashes due to hardware failures (e.g: CPU, memory violation, ...) When restarting, the consumer no way to know it has committed the offset successfully or not.
Because there are many problems might be happened, the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side. It doesn't mean we cannot but it takes a lot of effort to make sure the exactly-once delivery semantic. Kafka Stream upholds the work for engineers.
Noted that: Kafka Stream offers "exactly-once stream processing". It refers to consuming from a topic, materializing intermediate state in a Kafka topic and producing to one. If our application depends on some other external services (database, services...), we must make sure our external dependencies can guarantee exactly-once in those cases.
TL,DR: exactly-once for the full flow needs the cooperation between producers and consumers.
References:
Exactly-once semantics and how Apache Kafka does it
Transactions in Apache Kafka
Enabling exactly once Kafka streams

What is the difference between pulsar and kafka in regards to consumption?

In order to consume data from Kafka, we can have multiple consumers on a topic, totally decoupled. Then, what is meant by no shared consumption on the page(https://streaml.io/blog/pulsar-streaming-queuing) which shares differences between kafka and pulsar?
In his blog, Sijie is referring to shared messaging as queuing. With queuing messaging, multiple consumers are created to receive messages from a single topic. Which consumer gets the message is completely random.
The issue with implementing the messaging pattern with Kafka lies in way that Kafka consumers mark that they’ve consumed a message. Kafka consumers use what’s called a high watermark for consumer offsets. That means that a consumer can only say, “I’ve processed up to this point” rather than, “I’ve processed this message.”
Consider the scenario in which multiple Kafka consumers from the same consumer group were processing from the same topic partition and one of the consumers fails due to an exception while the other succeed. Because Kafka does not a have a built-in way to only acknowledge a single message, and only uses a high-water mark, the failed message would be erronously marked as consumed when in fact it failed and needs to be either reprocessed or published to an error queue, etc.
In order to avoid this situation, you would need to have just a single consumer per partition which limits the comsumption throughput of the topic. Which in turn requires you to increase the number of partitions in order to meet your throughput needs.
There is a detailed explanation in this blog post