Is it necessary to use transactions explicitly in Kafka Streams to get "effectively once" behaviour? - apache-kafka

A Confluence article states
Stream processing applications written in the Kafka Streams library can turn on exactly-once semantics by simply making a single config change, to set the config named “processing.guarantee” to “exactly_once” (default value is “at_least_once”), with no code change required.
But as transactions are said to be used, I would like to know: Are transactions used implicitly by Kafka Streams, or do I have to use them explicitly?
In other words, do I have to call something like .beginTransaction() and .commitTransaction(), or is all of this really being taken care of under the hood, and all that remains for me to be done is fine-tuning commit.interval.ms and cache.max.bytes.buffering?

Kafka Streams is using the transactions API to achieve exactly-once semantics implicitly, so you do not need to set any other configuration.
If you continue reading the blog it says:
"More specifically, when processing.guarantee is configured to exactly_once, Kafka Streams sets the internal embedded producer client with a transaction id to enable the idempotence and transactional messaging features, and also sets its consumer client with the read-committed mode to only fetch messages from committed transactions from the upstream producers."
More details can be found in KIP-129: Streams Exactly-Once Semantics

Related

Is message deduplication essential on the Kafka consumer side?

Kafka documentation states the following as the top scenario:
To process payments and financial transactions in real-time, such as
in stock exchanges, banks, and insurances
Also, regarding the main concepts, right at the very top:
Kafka provides various guarantees such as the ability to process
events exactly-once.
It’s funny the document says:
Many systems claim to provide "exactly once" delivery semantics, but
it is important to read the fine print, most of these claims are
misleading…
It seems obvious that payments/financial transactions must be processed „exactly-once“, but the rest of Kafka documentation doesn't make it obvious how this should be accomplished.
Let’s focus on the producer/publisher side:
If a producer attempts to publish a message and experiences a network
error it cannot be sure if this error happened before or after the
message was committed. This is similar to the semantics of inserting
into a database table with an autogenerated key. … Since 0.11.0.0, the
Kafka producer also supports an idempotent delivery option which
guarantees that resending will not result in duplicate entries in the
log.
KafkaProducer only ensures that it doesn’t incorrectly resubmit messages (resulting in duplicates) itself. Kafka cannot cover the case where client app code crashes (along with KafkaProducer) and it is not sure if it previously invoked send (or commitTransaction in case of transactional producer) which means that application-level retry will result in duplicate processing.
Exactly-once delivery for other destination systems generally
requires cooperation with such systems, but Kafka provides the offset
which makes implementing this feasible (see also Kafka Connect).
The above statement is only partially correct, meaning that while it exposes offsets on the Consumer side, it doesn’t make exactly-once feasible at all on the producer side.
Kafka consume-process-produce loop enables exactly-once processing leveraging sendOffsetsToTransaction, but again cannot cover the case of the possibility of duplicates on the first producer in the chain.
The provided official demo for EOS (Exactly once semantics) only provides an example for consume-process-produce EOS.
Solutions involving DB transaction log readers which read already committed transactions, also cannot be sure if they will produce duplicate messages in case they crash.
There is no support for a distributed transaction (XA) involving a database and the Kafka producer.
Does all of this mean that in order to ensure exactly once processing for payments and financial transactions (Kafka top use case!), we absolutely must perform business-level message deduplication on the consumer side, inspite of the Kafka transport-level “guarantees”/claims?
Note: I’m aware of:
Kafka Idempotent producer
but I would like a clear answer if deduplication is inevitable on the consumer side.
You must deduplicate on consumer side since rebalance on consumer side can really cause processing of events more than once in a consumer group based on fetch size and commit interval parameters.
If a consumer exits without acknowledging back to broker, Kafka will assign those events to another consumer in the group. Example if you are pulling a batch size of 5 events, if consumer dies or goes for a restart after processing first 3(If the external api/db fails OR the worse case your server runs out of memory and crashes), the current consumer dies abruptly without making a commit back/ack to broker. Hence the same batch gets assigned to another consumer from group(rebalance) where it starts supplies the same event batch again which will result in re-processing of same set of records resulting in duplication. A good read here : https://quarkus.io/blog/kafka-commit-strategies/
You can make use of internal state store of Kafka for deduplication. Here there is no offset/partition tracking, its kind of cache(persistent time bound on cluster).
In my case we push correlationId(a unique business identifier in incoming event) into it on successful processing of events, and all new events are checked against this before processing to make sure its not a duplicate event. Enabling state store will create more internal topics in Kafka cluster, just an FYI.
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html#state-stores

Is it possible to configure/code a Kafka consumer application for "Exactly Once" failure recovery w/o calling Producer methods?

Is it possible to configure/code a Kafka consumer application to unilaterally implement "Exactly Once Semantics" to handle failure recovery (i.e., resume where left off after a comm failure, etc) independent of producer code (calling KafkaProducer methods, etc)?
After some googling, it appears all the "Exactly Once Semantics" (EOS) demos I've found (at least so far) involve calling methods on both producer and consumer instances within the same application to accomplish this.
Here's an example: https://www.baeldung.com/kafka-exactly-once
Can an independent consumer/client application be configured for EOS failure recovery/resume - independent of producer code (i.e., calling KafkaProducer methods, etc)?
If so, can you point me to an example?
No, an independent consumer can not be configured to consume messages from Kafka exactly-once.
You can either have it as "at-most-once" or "at-least-once". Making it exactly-once highly depends on what the consumer is doing with the data and how and when you commit the messages back to Kafka.
You would have to implement this on your own. As an example you could have a look at the implementation of Spark Structured Streaming (also: spark-sql-kafka library) which makes use of write-ahead-logs in order to ensure exactly-once semantics.
Although the other answer is correct, I would state briefly this in a slightly different fashion:
the target / sink needs to be idempotent (KV store or UPSert to something like KUDU)
and the source replayable.
Quoting from this blog explains it well imho, https://www.waitingforcode.com/apache-spark-structured-streaming/fault-tolerance-apache-spark-structured-streaming/read:
"...
Indeed, neither the replayable source nor commit log don't guarantee
exactly-once processing itself. What if the batch commit fails ? As
told previously, the engine will detect the last committed offsets as
offsets to reprocess and output once again the processed data to the
sink. It'll obviously lead to a duplicated output. But it'd be the
case only when the writes and the sink aren't idempotent.
An idempotent write is the one that generates the same written data
for given input. The idempotent sink is the one that writes given
generated row only once, even if it's sent multiple times. A good
example of such sink are key-value data stores. Now, if the writer is
idempotent, obviously it generates the same keys every time and since
the row identification is key-based, the whole process is idempotent.
Together with replayable source it guarantees exactly-once end-2-end
processing.
..."
As an English native speaker not 100% sure the don't is correct, but I think we can get the drift.

Use transactional API and exactly-once with regular Producers and Consumers

Confluent documents that I was able to find all focus on Kafka Streams application when it comes to exactly-once/transactions/idempotence.
However, the APIs for transactions were introduced on a "regular" Producer/Consumer level and all the explanations and diagrams focus on them.
I was wondering whether it's Ok to use those API directly without Kafka Streams.
I do understand the consequences of Kafka processing boundaries and the guarantees, and I'm Ok with violating it. I don't have a need for 100% exactly-once guarantee, it's Ok to have a duplicate once in a while, for example, when I read from/write to external systems.
The problem I'm facing is that I need to create an ETL pipeline for Big Data project where we are getting a lot of duplicates when the apps are restated/relocated to different hosts automatically by Kubernetes.
In general, it's not a problem to have some duplicates, it's a pipeline for analytics where duplicates are acceptable, but if the issue can be mitigated at least on the Kafka side - that would be great. Will using transactional API guarantee exactly-once for Kafka at least(to make sure that re-processing doesn't happen when reassignments/shut-downs/scaling activities are happening)?
Switching to Kafka Streams is not an option because we are quite late in the project.
Exactly-once semantics is achievable with regular producers and consumers also. Kafka Streams are built on top of these clients themselves.
We can use an idempotent producer to do achieve this.
When dealing with external systems, it is important to ensure that we don't produce the same message again and again using producer.send(). Idempotence applies to internal retries by Kafka clients but doesn't take care of duplicate calls to send().
When we produce messages that arrive from a source we need to ensure that the source doesn't produce a duplicate message. For example, if it is a database, use a WAL and last maintain last read offset for that WAL and restart from that point. Debezium, for example does that. You may check to see if it supports your datasource.

Kafka excatly-once producer consumer

I am implementing Exactly-once semantics for a simple data pipeline, with Kafka as message broker. I can force Kafka producer to write each produced record exactly once by setting set enable.idempotence=true.
However, on the consumption front I need to guarantee that the consumer reads each record exactly once (I am not interested in storing the consumed records to external system or to another Kafka topic just processing). To achieve this, I have to ensure that polled records are processed and their offsets are committed to __consumer_offsets topic atomically/transactionally (both succeed/fail together).
In such case do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop, where inside the transaction I perform: (1) processing of the consumed records and (2) committing their offsets, before closing the transaction. Would the normal commitSync/commitAsync serve in such case?
"on the consumption front I need to guarantee that the consumer reads each record exactly once"
The answer from Gopinath explains well how you can achieve exactly-once between a KafkaProducer and KafkaConsumer. These configurations (together with the application of Transaction API in the KafkaProducer) guarantees that all data send by the producer will be stored in Kafka exactly once. However, it does not guarantee that the Consumer is reading the data exactly once. This, of course, depends on your offset management.
Anyway, I understand your question that you want to know how the Consumer itself is processing a consumed message exactly once.
For this you need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally.
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
"do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop"
The Transaction API is only available for KafkaProducers and as far as I am aware cannot be used for your offset management.
'Exactly-once' functionality in Kafka can be achieved by a combination of these 3 settings:
isolation.level = read_committed
transactional.id = <unique_id>
processing.guarantee = exactly_once
More information on enabling the exactly-once functionality:
https://www.confluent.io/blog/simplified-robust-exactly-one-semantics-in-kafka-2-5/
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

What is Kafka message tweaking?

From https://data-flair.training/blogs/advantages-and-disadvantages-of-kafka/:
As we know, the broker uses certain system calls to deliver messages to the consumer. However, Kafka’s performance reduces significantly if the message needs some tweaking. So, it can perform quite well if the message is unchanged because it uses the capabilities of the system.
How can a message be tweaked? If I want to demonstrate that a message can be tweaked, what do I have to do?
The suspected concern with Kafka performance is mentioned in this statement:
Suspicion:
Kafka’s performance reduces significantly if the message needs some
tweaking. So, it can perform quite well if the message is unchanged
Clarification:
As a user of Kafka for several years, I found Kafka's guarantee,
that a message cannot be altered when it is inside a queue, to be
one of its best features.
Kafka does not allow consumers to directly alter in-flight messages in the queue (topic).
Message content can be altered either before it is published to a topic, or after it is consumed from a topic only.
The business requirement that needs messages to be modified can be implemented using multiple topics, in this pattern:
(message) --> Topic 1 --> (consume & modify message) --> Topic 2
Using multiple topics for implementing the message modification functionality will only increase the storage requirement. It will not have any impact on performance.
Kafka's design of not allowing in-flight modification of messages provides 'Data integrity'. This is one of the driving factors behind the widespread use of Kafka in financial processing applications.
It's not clear what that means. That term is not used in official Kafka documentation, and messages are immutable once sent by a producer.