Adding to a Kafka topic exactly once - apache-kafka

Since 0.11, Kafka Streams offers exactly-once guarantees, but their definition of "end" in end-to-end seems to be "a Kafka topic".
For real-time applications, the first "end" however is generally not a Kafka topic, but some kind of application that outputs data - perhaps going through multiple tiers and networks - to a Kafka topic.
So does Kafka offer something to add to a topic exactly-once, in the face of network failures and application crashes and restarts? Or do I have to use Kafka's at-least-once semantics and deduplicate that topic with potential duplicates into another exactly-once topic, by means of some unique identifier?
Edit Due to popular demand, here's a specific use case. I have a client C that creates messages and sends them to a server S, which uses a KafkaProducer to add those messages to Kafka topic T.
How can I guarantee, in the face of
crashes of C, S, and members of the Kafka cluster
temporary network problems
that all messages that C creates end up in T, exactly once (and - per partition - in the correct order)?
I would of course make C resend all messages for which it did not get an ack from S -> at-least-once. But to make it exactly once, the messages that C sends would need to contain some kind of ID, so that deduplication can be performed. That, I don't know how I can do it with Kafka.

Kafka's exactly-once feature, in particular the "idempotent producer" can help you with server crashes and network issues.
You can enable idempotency via Producer config enable.idempotence=true that you pass in as any other config. This ensures that every message is written exactly once and in the correct ordered if the server crashes or if there are any network issues.
Kafka's exactly-once feature, does not provide support if the producer crashes. For this case, you would need to write manual code, to figure out which messages got appended to the topic successfully before the crash (by using a consumer) and resume sending where you left off. As an alternative, you can still deduplicate consumer side as you mentioned already.

You might want to have a look at kafka's Log compaction feature. It will deduplicate messages for you provided u have unique key for all the duplicate messages.
https://kafka.apache.org/documentation/#compaction
Update:
Log compaction is not very reliable however you can change some settings to work as expected.
The more efficient way is to use kafka streams. You can achieve this using KTables.

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

Is message deduplication essential on the Kafka consumer side?

Kafka documentation states the following as the top scenario:
To process payments and financial transactions in real-time, such as
in stock exchanges, banks, and insurances
Also, regarding the main concepts, right at the very top:
Kafka provides various guarantees such as the ability to process
events exactly-once.
It’s funny the document says:
Many systems claim to provide "exactly once" delivery semantics, but
it is important to read the fine print, most of these claims are
misleading…
It seems obvious that payments/financial transactions must be processed „exactly-once“, but the rest of Kafka documentation doesn't make it obvious how this should be accomplished.
Let’s focus on the producer/publisher side:
If a producer attempts to publish a message and experiences a network
error it cannot be sure if this error happened before or after the
message was committed. This is similar to the semantics of inserting
into a database table with an autogenerated key. … Since 0.11.0.0, the
Kafka producer also supports an idempotent delivery option which
guarantees that resending will not result in duplicate entries in the
log.
KafkaProducer only ensures that it doesn’t incorrectly resubmit messages (resulting in duplicates) itself. Kafka cannot cover the case where client app code crashes (along with KafkaProducer) and it is not sure if it previously invoked send (or commitTransaction in case of transactional producer) which means that application-level retry will result in duplicate processing.
Exactly-once delivery for other destination systems generally
requires cooperation with such systems, but Kafka provides the offset
which makes implementing this feasible (see also Kafka Connect).
The above statement is only partially correct, meaning that while it exposes offsets on the Consumer side, it doesn’t make exactly-once feasible at all on the producer side.
Kafka consume-process-produce loop enables exactly-once processing leveraging sendOffsetsToTransaction, but again cannot cover the case of the possibility of duplicates on the first producer in the chain.
The provided official demo for EOS (Exactly once semantics) only provides an example for consume-process-produce EOS.
Solutions involving DB transaction log readers which read already committed transactions, also cannot be sure if they will produce duplicate messages in case they crash.
There is no support for a distributed transaction (XA) involving a database and the Kafka producer.
Does all of this mean that in order to ensure exactly once processing for payments and financial transactions (Kafka top use case!), we absolutely must perform business-level message deduplication on the consumer side, inspite of the Kafka transport-level “guarantees”/claims?
Note: I’m aware of:
Kafka Idempotent producer
but I would like a clear answer if deduplication is inevitable on the consumer side.
You must deduplicate on consumer side since rebalance on consumer side can really cause processing of events more than once in a consumer group based on fetch size and commit interval parameters.
If a consumer exits without acknowledging back to broker, Kafka will assign those events to another consumer in the group. Example if you are pulling a batch size of 5 events, if consumer dies or goes for a restart after processing first 3(If the external api/db fails OR the worse case your server runs out of memory and crashes), the current consumer dies abruptly without making a commit back/ack to broker. Hence the same batch gets assigned to another consumer from group(rebalance) where it starts supplies the same event batch again which will result in re-processing of same set of records resulting in duplication. A good read here : https://quarkus.io/blog/kafka-commit-strategies/
You can make use of internal state store of Kafka for deduplication. Here there is no offset/partition tracking, its kind of cache(persistent time bound on cluster).
In my case we push correlationId(a unique business identifier in incoming event) into it on successful processing of events, and all new events are checked against this before processing to make sure its not a duplicate event. Enabling state store will create more internal topics in Kafka cluster, just an FYI.
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html#state-stores

Kafka Stream delivery semantic for a simple forwarder

I got a stateless Kafka Stream that consumes from a topic and publishes into a different queue (Cloud PubSub) within a forEach. The topology does not end on producing into a new Kafka topic.
How do I know which delivery semantic I can guarantee? Knowing that it's just a message forwarder and no deserialisation or any other transformation or whatsoever is applied: are there any cases in which I could have duplicates or missed messages?
I'm thinking about the following scenarios and related impacts on how offsets are commited:
Sudden application crash
Error occurring on publish
Thanks guys
If You consider the kafka to kafka loop that a Kafka Stream application usually creates, setting the property:
processing.guarantee=exactly_once
It's enough to have exactly-once semantic, of course also in failure scenarios.
Under the hood Kafka uses a transaction to guarantee that the consume - process - produce - commit offset processing is executed with all or nothing guarantee.
Writing a sink connector with exaclty once semantic kafka to Google PubSub, would mean solving the same issues Kafka solves already for the kafka to kafka scenario.
The producer.send() could result in duplicate writes of message B due to internal retries. This is addressed by the idempotent producer and is not the focus of the rest of this post.
We may reprocess the input message A, resulting in duplicate B messages being written to the output, violating the exactly once processing semantics. Reprocessing may happen if the stream processing application crashes after writing B but before marking A as consumed. Thus when it resumes, it will consume A again and write B again, causing a duplicate.
Finally, in distributed environments, applications will crash or—worse!temporarily lose connectivity to the rest of the system. Typically, new instances are automatically started to replace the ones which were deemed lost. Through this process, we may have multiple instances processing the same input topics and writing to the same output topics, causing duplicate outputs and violating the exactly once processing semantics. We call this the problem of “zombie instances.”
Assuming your producer logic to Cloud PubSub does not suffer from problem 1, just like Kafka producers when using enable.idempotence=true, you are still left with problems 2 and 3.
Without solving these issues your processing semantic will be the delivery semantic your consumer is using, so at least once, if you choose to manually commit the offset.

What is Kafka message tweaking?

From https://data-flair.training/blogs/advantages-and-disadvantages-of-kafka/:
As we know, the broker uses certain system calls to deliver messages to the consumer. However, Kafka’s performance reduces significantly if the message needs some tweaking. So, it can perform quite well if the message is unchanged because it uses the capabilities of the system.
How can a message be tweaked? If I want to demonstrate that a message can be tweaked, what do I have to do?
The suspected concern with Kafka performance is mentioned in this statement:
Suspicion:
Kafka’s performance reduces significantly if the message needs some
tweaking. So, it can perform quite well if the message is unchanged
Clarification:
As a user of Kafka for several years, I found Kafka's guarantee,
that a message cannot be altered when it is inside a queue, to be
one of its best features.
Kafka does not allow consumers to directly alter in-flight messages in the queue (topic).
Message content can be altered either before it is published to a topic, or after it is consumed from a topic only.
The business requirement that needs messages to be modified can be implemented using multiple topics, in this pattern:
(message) --> Topic 1 --> (consume & modify message) --> Topic 2
Using multiple topics for implementing the message modification functionality will only increase the storage requirement. It will not have any impact on performance.
Kafka's design of not allowing in-flight modification of messages provides 'Data integrity'. This is one of the driving factors behind the widespread use of Kafka in financial processing applications.
It's not clear what that means. That term is not used in official Kafka documentation, and messages are immutable once sent by a producer.

Kafka KStream OutOfOrderSequenceException

Our application intermittently encounters OutOfOrderSequenceException in our streams code. Which causes stream thread to stop.
Implementation is simple, 2 KStreams join and output to another topic.
When searching for a solution to this OutOfOrderSequenceException
I have found below documentation on Confluent
https://docs.confluent.io/current/streams/concepts.html#out-of-order-handling
But could not find what settings, config or trade-offs are being referred here ?
How to manually do bookkeeping ?
If users want to handle such out-of-order data, generally they need to
allow their applications to wait for longer time while bookkeeping
their states during the wait time, i.e. making trade-off decisions
between latency, cost, and correctness. In Kafka Streams, users can
configure their window operators for windowed aggregations to achieve
such trade-offs (details can be found in the Developer Guide).
From the JavaDocs of OutOfOrderSequenceException:
This exception indicates that the broker received an unexpected sequence number from the producer, which means that data may have been lost. If the producer is configured for idempotence only (i.e. if enable.idempotence is set and no transactional.id is configured), it is possible to continue sending with the same producer instance, but doing so risks reordering of sent records. For transactional producers, this is a fatal error and you should close the producer.
Sequence numbers are internally assigned numbers to each message that is written into a topic.
Because it is an internal error, it's hard to tell what the root cause could be though.
Updates :
After updating Kafka Brokers and KStream version, issue seems to have subsided.
Also, as per the recommendation,
https://kafka.apache.org/10/documentation/streams/developer-guide/config-streams.html#recommended-configuration-parameters-for-resiliency
I have updated acks to all. replication factor was already 3.