Read Uncommitted in Kafka - apache-kafka

I have some confusion about isolation.level= read_uncommited in Kafka.
In https://www.confluent.io/blog/transactions-apache-kafka/ blog have an explain about read_uncommited:
In short: Kafka guarantees that a consumer will eventually deliver only non-transactional messages or committed transactional messages. It will withhold messages from open transactions and filter out messages from aborted transactions.
But in the official Kafka documentation https://kafka.apache.org/documentation/#consumerconfigs_isolation.level it explains:
If set to read_uncommitted (the default), consumer.poll() will return all messages, even transactional messages which have been aborted. Non-transactional messages will be returned unconditionally in either mode.
So which one is correct?

Both are factually correct. The summary copied from the confluent blog is being misunderstood.
Please note that the text you copied is under the section Reading Transactional Messages. The text is trying to explain that when the data is being read by transactional or read_committed consumers:
messages from aborted transactions are filtered out
messages from open transactions are withheld (they will eventually be given to application if transaction is committed or will be filtered out if transaction is aborted)

Related

kafka stream with exactly_once enabled generate several duplicated messages

Kafka stream with exactly_once enabled generate several duplicated messages with (not committed transaction status)
I did a tests in my pc :
without "exactly_once"
for 100_000 messages, I got 100_000 on topic target.
with props.put(PROCESSING_GUARANTEE_CONFIG, "exactly_once");
for 100_000 messages, I got 100_554 on target topic.
For this last one consuming the topic target with "read_committed" allow to read only 100_000 messages.
But the remaining 554 pollute the flow monitoring.
Is there a reason to have more 554 messages when activating "exactly_once" option ?
Thank you.
The 554 messages are most likely the transaction markers that are needed to provide exactly-once delivery semantics.
When you use exactly-once, Kafka Streams uses Kafka transactions to write records to the output topics. Kafka transactions use transaction markers to mark whether records were part of a committed or an aborted transaction.
A consumer with isolation level read_commited interprets the transaction markers to decide which records to skip because they were part of an aborted transactions and which records to return in calls to poll() because they were part of a committed transaction.

What happens to the kafka messages if the microservice crashes before kafka commit?

I am new to kafka.I have a Kafka Stream using java microservice that consumes the messages from kafka topic produced by producer and processes. The kafka commit interval has been set using the auto.commit.interval.ms . My question is, before commit if the microservice crashes , what will happen to the messages that got processed but didn't get committed? will there be duplicated records? and how to resolve this duplication, if happens?
Kafka has exactly-once-semantics which guarantees the records will get processed only once. Take a look at this section of Spring Kafka's docs for more details on the Spring support for that. Also, see this section for the support for transactions.
Kafka provides various delivery semantics. These delivery semantics can be decided on the basis of your use-case you've implemented.
If you're concerned that your messages should not get lost by consumer service - you should go ahead with at-lease once delivery semantic.
Now answering your question on the basis of at-least once delivery semantics:
If your consumer service crashes before committing the Kafka message, it will re-stream the message once your consumer service is up and running. This is because the offset for a partition was not committed. Once the message is processed by the consumer, committing an offset for a partition happens. In simple words, it says that the offset has been processed and Kafka will not send the committed message for the same partition.
at-least once delivery semantics are usually good enough for use cases where data duplication is not a big issue or deduplication is possible on the consumer side. For example - with a unique key in each message, a message can be rejected when writing duplicate data to the database.
There are mainly three types of delivery semantics,
At most once-
Offsets are committed as soon as the message is received at consumer.
It's a bit risky as if the processing goes wrong the message will be lost.
At least once-
Offsets are committed after the messages processed so it's usually the preferred one.
If the processing goes wrong the message will be read again as its not been committed.
The problem with this is duplicate processing of message so make sure your processing is idempotent. (Yes your application should handle duplicates, Kafka won't help here)
Means in case of processing again will not impact your system.
Exactly once-
Can be achieved for kafka to kafka communication using kafka streams API.
Its not your case.
You can choose semantics from above as per your requirement.

Aborting a Kafka transaction with a compacted topic

I have a use case where I need to send 3 messages to 3 different topics within a single transaction.
The issue is that one of the topics is compacted, and I'm pretty new to Kafka transactions so I'm not really sure how transaction cancelling works.
My question is: what actually happens if the transaction fails or aborts (application crash, exception etc)? Will the records of the aborted transaction eventually be removed from the compacted topic's log (like null records)? Is it same with non-compacted topics?
Thanks.
It's a bit late but according to my experience Kafka keeps messages from aborted transactions in the log but marks them with a special marker so that consumer using read_committed isolation level would filter them out but consumer that uses read_uncommitted would still see them.
This could lead to interesting issues like this.
Based on the fact that aborted messages are filtered at the consumer level I think that they should fall under general compaction logic and would be compacted if there's a message with the same key further down the topic but aborted message shouldn't override last committed message with the same key because that would violate guarantees Kafka provides for transactions.

How does Kafka guarantee consumers doesn't read a single message twice?

How does Kafka guarantee consumers doesn't read a single message twice?
Or is the above scenario possible?
Could the same message be read twice by single or by multiple consumers?
There are many scenarios which cause Consumer to consume the duplicate message
Producer published the message successfully but failed to acknowledge which cause to retry the same message
Producer publishing a batch of the message but failed partially published messages. In that case, it will retry and resent the same batch again which will cause duplicate
Consumers receive a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false).
If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
To guarantee not to consume duplicate messages the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side.
You can use the below parameter to achieve exactly one semantic. But please you have understood this comes with a compromise with performance.
enable idempotence on the producer side which will guarantee not to publish the same message twice
enable.idempotence=true
Defined Transaction (isolation.level) is read_committed
isolation.level=read_committed
In Kafka Stream above setting can be achieved by setting Exactly-Once
semantic true to make it as unit transaction
Idempotent
Idempotent delivery enables producers to write messages to Kafka exactly once to a particular partition of a topic during the lifetime of a single producer without data loss and order per partition.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure) Consumer uses isolation. level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
Please refer more in detail refrence
It is absolutely possible if you don't make your consume process idempotent.
For example; you are implementing at-least-one delivery semantic and firstly process messages and then commit offsets. It is possible to couldn't commit offsets because of server failure or rebalance. (maybe your consumer is revoked at that time) So when you poll you will get same messages twice.
To be precise, this is what Kafka guarantees:
Kafka provides order guarantee of messages in a partition
Produced messages are considered "committed" when they were written to the partition on all its in-sync replicas
Messages that are committed will not be losts as long as at least one replica remains alive
Consumers can only read messages that are committed
Regarding consuming messages, the consumers keep track of their progress in a partition by saving the last offset read in an internal compacted Kafka topic.
Kafka consumers can automatically commit the offset if enable.auto.commit is enabled. However, that will give "at most once" semantics. Hence, usually the flag is disabled and the developer commits the offset explicitly once the processing is complete.

Kafka: isolation level implications

I have a use case where I need 100% reliability, idempotency (no duplicate messages) as well as order-preservation in my Kafka partitions. I'm trying to set up a proof of concept using the transactional API to achieve this. There is a setting called 'isolation.level' that I'm struggling to understand.
In this article, they talk about the difference between the two options
There are now two new isolation levels in Kafka consumer:
read_committed: Read both kind of messages that are not part of a
transaction and that are, after the transaction is committed.
Read_committed consumer uses end offset of a partition, instead of
client-side buffering. This offset is the first message in the
partition belonging to an open transaction. It is also known as “Last
Stable Offset” (LSO). A read_committed consumer will only read up till
the LSO and filter out any transactional messages which have been
aborted.
read_uncommitted: Read all messages in offset order without
waiting for transactions to be committed. This option is similar to
the current semantics of a Kafka consumer.
The performance implication here is obvious but I'm honestly struggling to read between the lines and understand the functional implications/risk of each choice. It seems like read_committed is 'safer' but I want to understand why.
First, the isolation.level setting only has an impact on the consumer if the topics it's consuming from contains records written using a transactional producer.
If so, if it's set to read_uncommitted, the consumer will simply read everything including aborted transactions. That is the default.
When set to read_committed, the consumer will only be able to read records from committed transactions (in addition to records not part of transactions). It also means that in order to keep ordering, if a transaction is in-flight the consumer will not be able to consume records that are part of that transation. Basically the broker will only allow the consumer to read up to the Last Stable Offset (LSO). When the transation is committed (or aborted), the broker will update the LSO and the consumer will receive the new records.
If you don't tolerate duplicates or records from aborted transactions, then you should use read_committed. As you hinted this creates a small delay in consuming as records are only visible once transactions are committed. The impact mostly depends on the sizes of your transactions, ie how often you commit.
If you are not using transactions in your producer, the isolation level does not matter. If you are, then you must use read_committed if you want the consumers to honor the transactional nature. Here are some additional references:
https://www.confluent.io/blog/transactions-apache-kafka/
https://docs.google.com/document/d/11Jqy_GjUGtdXJK94XGsEIK7CP1SnQGdp2eF0wSw9ra8/edit
if so, if it's set to read_uncommitted, the consumer will simply read everything including aborted transactions. That is the default.
To clarify things a bit for readers: this is the default only in the Java Kafka client. It was done to not change the semantics when transactions were introduced back in the day.
It's the opposite in librdkafka which sets the isolation.level configuration to read_committed by default. As a result, all libraries built on top of librdkafka will consume only committed messages by default: confluent-kafka-python, confluent-kafka-dotnet, rdkafka-ruby.
KafkaJS consumers also uses read committed by default (readUncommitted set to false).