Can I make sure Kafka doesn't accept two copies of the same message? - apache-kafka

I'm writing messages along with timestamps to kafka. If I retry, the timestamp might change, and the producer that's writing, but the message content and message id is the same. The message id is generated before the message gets here, and it's a uuid.
How can I make sure kafka doesn't accept the second copy, if it successfully wrote to the topic, but the ack got lost, so the service up the chain retries? The consumers must not ever see the duplicate message.

In general there are two cases when the same message can be sent to Kafka:
During normal operation your application intentionally sends messages with the same uuid to Kafka and you want Kafka to do deduplication.
While you are sending a message to Kafka your code or Kafka brokers fail and you want to make sure the message you try to send again isn't duplicated, and also isn't lost.
I assume you are interested in case 2.. The Kafka developer's call case 2. exactly-once delivery. The latest versions of Kafka support transactions in order to enable exactly-once delivery. A complete explanation of how Kafka does this along with a code snippet can be found in this article by Confluent (the Kafka company).

Related

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

Is message deduplication essential on the Kafka consumer side?

Kafka documentation states the following as the top scenario:
To process payments and financial transactions in real-time, such as
in stock exchanges, banks, and insurances
Also, regarding the main concepts, right at the very top:
Kafka provides various guarantees such as the ability to process
events exactly-once.
It’s funny the document says:
Many systems claim to provide "exactly once" delivery semantics, but
it is important to read the fine print, most of these claims are
misleading…
It seems obvious that payments/financial transactions must be processed „exactly-once“, but the rest of Kafka documentation doesn't make it obvious how this should be accomplished.
Let’s focus on the producer/publisher side:
If a producer attempts to publish a message and experiences a network
error it cannot be sure if this error happened before or after the
message was committed. This is similar to the semantics of inserting
into a database table with an autogenerated key. … Since 0.11.0.0, the
Kafka producer also supports an idempotent delivery option which
guarantees that resending will not result in duplicate entries in the
log.
KafkaProducer only ensures that it doesn’t incorrectly resubmit messages (resulting in duplicates) itself. Kafka cannot cover the case where client app code crashes (along with KafkaProducer) and it is not sure if it previously invoked send (or commitTransaction in case of transactional producer) which means that application-level retry will result in duplicate processing.
Exactly-once delivery for other destination systems generally
requires cooperation with such systems, but Kafka provides the offset
which makes implementing this feasible (see also Kafka Connect).
The above statement is only partially correct, meaning that while it exposes offsets on the Consumer side, it doesn’t make exactly-once feasible at all on the producer side.
Kafka consume-process-produce loop enables exactly-once processing leveraging sendOffsetsToTransaction, but again cannot cover the case of the possibility of duplicates on the first producer in the chain.
The provided official demo for EOS (Exactly once semantics) only provides an example for consume-process-produce EOS.
Solutions involving DB transaction log readers which read already committed transactions, also cannot be sure if they will produce duplicate messages in case they crash.
There is no support for a distributed transaction (XA) involving a database and the Kafka producer.
Does all of this mean that in order to ensure exactly once processing for payments and financial transactions (Kafka top use case!), we absolutely must perform business-level message deduplication on the consumer side, inspite of the Kafka transport-level “guarantees”/claims?
Note: I’m aware of:
Kafka Idempotent producer
but I would like a clear answer if deduplication is inevitable on the consumer side.
You must deduplicate on consumer side since rebalance on consumer side can really cause processing of events more than once in a consumer group based on fetch size and commit interval parameters.
If a consumer exits without acknowledging back to broker, Kafka will assign those events to another consumer in the group. Example if you are pulling a batch size of 5 events, if consumer dies or goes for a restart after processing first 3(If the external api/db fails OR the worse case your server runs out of memory and crashes), the current consumer dies abruptly without making a commit back/ack to broker. Hence the same batch gets assigned to another consumer from group(rebalance) where it starts supplies the same event batch again which will result in re-processing of same set of records resulting in duplication. A good read here : https://quarkus.io/blog/kafka-commit-strategies/
You can make use of internal state store of Kafka for deduplication. Here there is no offset/partition tracking, its kind of cache(persistent time bound on cluster).
In my case we push correlationId(a unique business identifier in incoming event) into it on successful processing of events, and all new events are checked against this before processing to make sure its not a duplicate event. Enabling state store will create more internal topics in Kafka cluster, just an FYI.
https://kafka.apache.org/10/documentation/streams/developer-guide/processor-api.html#state-stores

Logs and statements with Snowflake and Kafka Connector

Is there a possibility that we may lose some messages if we use snowflake kafka connector. For example if the kafka connector reads the message and commits the offset before the message is written to the variant table, then we will lose that message. Is this a scenario that can happen if we use kafka connect
If you have any examples, these are welcome as well, thank you!
According to the documentationfrom snowflake
Both Kafka and the Kafka connector are fault-tolerant. Messages are neither duplicated nor silently dropped. Messages are delivered exactly once, or an error message will be generated. If an error is detected while loading a record (for example, the record was expected to be a well-formed JSON or Avro record, but wasn’t well-formed), then the record is not loaded; instead, an error message is returned.
Limitations are listed as well. Arguably, nothing is impossible, but if you don't trust Kafka I'd not use Kafka at all.
How and where you could loose messages depends on your overall architecture too, like are records written into the Kafka-Topics you're consuming, how do you partition?

How to make restart-able producer?

Latest version of kafka support exactly-once-semantics (EoS). To support this notion, extra details are added to each message. This means that at your consumer; if you print offsets of messages they won't be necessarily sequential. This makes harder to poll a topic to read the last committed message.
In my case, consumer printed something like this
Offset-0 0
Offset-2 1
Offset-4 2
Problem: In order to write restart-able proudcer; I poll the topic and read the content of last message. In this case; last message would be offset#5 which is not a valid consumer record. Hence, I see errors in my code.
I can use the solution provided at : Getting the last message sent to a kafka topic. The only problem is that instead of using consumer.seek(partition, last_offset=1); I would use consumer.seek(partition, last_offset-2). This can immediately resolve my issue, but it's not an ideal solution.
What would be the most reliable and best solution to get last committed message for a consumer written in Java? OR
Is it possible to use local state-store for a partition? OR
What is the most recommended way to store last message to withstand producer-failure? OR
Are kafka connectors restartable? Is there any specific API that I can use to make producers restartable?
FYI- I am not looking for quick fix
In my case, multiple producers push data to one big topic. Therefore, reading entire topic would be nightmare.
The solution that I found is to maintain another topic i.e. "P1_Track" where producer can store metadata. Within a transaction a producer will send data to one big topic and P1_Track.
When I restart a producer, it will read P1_Track and figure out where to start from.
Thinking about storing last committed message in a database and using it when producer process restarts.

Adding to a Kafka topic exactly once

Since 0.11, Kafka Streams offers exactly-once guarantees, but their definition of "end" in end-to-end seems to be "a Kafka topic".
For real-time applications, the first "end" however is generally not a Kafka topic, but some kind of application that outputs data - perhaps going through multiple tiers and networks - to a Kafka topic.
So does Kafka offer something to add to a topic exactly-once, in the face of network failures and application crashes and restarts? Or do I have to use Kafka's at-least-once semantics and deduplicate that topic with potential duplicates into another exactly-once topic, by means of some unique identifier?
Edit Due to popular demand, here's a specific use case. I have a client C that creates messages and sends them to a server S, which uses a KafkaProducer to add those messages to Kafka topic T.
How can I guarantee, in the face of
crashes of C, S, and members of the Kafka cluster
temporary network problems
that all messages that C creates end up in T, exactly once (and - per partition - in the correct order)?
I would of course make C resend all messages for which it did not get an ack from S -> at-least-once. But to make it exactly once, the messages that C sends would need to contain some kind of ID, so that deduplication can be performed. That, I don't know how I can do it with Kafka.
Kafka's exactly-once feature, in particular the "idempotent producer" can help you with server crashes and network issues.
You can enable idempotency via Producer config enable.idempotence=true that you pass in as any other config. This ensures that every message is written exactly once and in the correct ordered if the server crashes or if there are any network issues.
Kafka's exactly-once feature, does not provide support if the producer crashes. For this case, you would need to write manual code, to figure out which messages got appended to the topic successfully before the crash (by using a consumer) and resume sending where you left off. As an alternative, you can still deduplicate consumer side as you mentioned already.
You might want to have a look at kafka's Log compaction feature. It will deduplicate messages for you provided u have unique key for all the duplicate messages.
https://kafka.apache.org/documentation/#compaction
Update:
Log compaction is not very reliable however you can change some settings to work as expected.
The more efficient way is to use kafka streams. You can achieve this using KTables.