I'm using a Kafka Producer and my application sends individual ProducerRecords all with the same key into a single partition, and these ProducerRecords are then batched (using batch.size and linger.ms parameters) before being sent to the brokers. I have enable.idempotence=true and acks=all.
If one record in the middle of a batch fails to be written, for example if a host crashes or a network failure or disk failure occurs or the record failed to be acked by the minimum replicas, does Kafka guarantee that all subsequent records will also not be written? Or is there a possibility that a record in the middle of a batch could be missing?
If one record in the middle of a batch fails to be written, for example if a host crashes or a network failure or disk failure occurs or the record failed to be acked by the minimum replicas, does Kafka guarantee that all subsequent records will also not be written?
Yes, if any message within a batch fails, then all messages in the same batch fail. So none of the messages within the batch will be written to the broker's disk.
Or is there a possibility that a record in the middle of a batch could be missing?
No, either all or none messages of the batch are written to the broker.
This is achieved by the separation between the Producer client thread and a local buffer that queues and batches the data before sending it physically to the broker.
Since your records are all going to the same partition, you can safely assume all previous records will also be there.
Kafka guarantees ordering in a given partition, so if you are sending messages m1 and m2 (in order) to the partition, the batch and linger logic will not override the ordering. In other words, if you see the message m2 at your consumer, you can safely assume that m1 was delivered safely as well.
Related
I was going through document what I understood we can achieve exactly-once transaction with enabling idempotence=true
idempotence: The Idempotent producer enables exactly once for a
producer against a single topic. Basically each single message send
has stonger guarantees and will not be duplicated in case there's an
error
So if already we have idempotence then why we need another property exactly-once in Kafka Stream? What exactly different between idempotence vs exactly-once
Why exactly-once property not available in normal Kafka Producer?
In a distributed environment failure is a very common scenario that can be happened any time. In the Kafka environment, the broker can crash, network failure, failure in processing, failure while publishing message or failure to consume messages, etc.
These different scenarios introduced different kinds of data loss and duplication.
Failure scenarios
A(Ack Failed): Producer published message successfully with retry>1 but could not receive acknowledge due to failure. In that case, the Producer will retry the same message that might introduce duplicate.
B(Producer process failed in batch messages): Producer sending a batch of messages it failed with few published success. In that case and once the producer will restart it will again republish all messages from the batch which will introduce duplicate in Kafka.
C(Fire & Forget Failed) Producer published message with retry=0(fire and forget). In case of failure published will not aware and send the next message this will cause the message lost.
D(Consumer failed in batch message) A consumer receives a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false). If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
Exactly-Once semantics
In this case, even if a producer tries to resend a message, it leads
to the message will be published and consumed by consumers exactly once.
To achieve Exactly-Once semantic in Kafka, it uses below 3 property
enable.idempotence=true (address a, b & c)
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5(Producer will always have one in-flight request per connection)
isolation.level=read_committed (address d )
Enable Idempotent(enable.idempotence=true)
Idempotent delivery enables the producer to write a message to Kafka exactly
once to a particular partition of a topic during the lifetime of a
single producer without data loss and order per partition.
"Note that enabling idempotence requires MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION to be less than or equal to 5, RETRIES_CONFIG to be greater than 0 and ACKS_CONFIG be 'all'. If these values are not explicitly set by the user, suitable values will be chosen. If incompatible values are set, a ConfigException will be thrown"
To achieve idempotence Kafka uses a unique id which is called product id or PID and sequence number while producing messages. The producer keeps incrementing the sequence number on each message published which map with unique PID. The broker always compare the current sequence number with the previous one and it rejects if the new one is not +1 greater than the previous one which avoids duplication and same time if more than greater show lost in messages
In a failure scenario broker will compare the sequence numbers with the previous one and if the sequence not increased +1 will reject the message.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure)
Consumer uses isolation.level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
If a consumer with isolation.level=read_committed reaches a control message for a transaction that has not completed, it will not deliver any more messages from this partition until the producer commits or aborts the transaction or a transaction timeout occurs. The transaction timeout is determined by the producer using the configuration transaction.timeout.ms(default 1 minute).
Exactly-Once in Producer & Consumer
In normal conditions where we have separate producers and consumers. The producer has to idempotent and same time manage transactions so consumers can use isolation.level to read-only read_committed to make the whole process as an atomic operation.
This makes a guarantee that the producer will always sync with the source system. Even producer crash or a transaction aborted, it always is consistent and publishes a message or batch of the message as a unit once.
The same consumer will either receive a message or batch of the message as a unit once.
In Exactly-Once semantic Producer along with Consumer will appear as
atomic operation which will operate as one unit. Either publish and
get consumed once at all or aborted.
Exactly Once in Kafka Stream
Kafka Stream consumes messages from topic A, process and publish a message to Topic B and once publish use commit(commit mostly run undercover) to flush all state store data to disk.
Exactly-once in Kafka Stream is a read-process-write pattern that guarantees that this operation will be treated as an atomic operation. Since Kafka Stream caters producer, consumer and transaction all together Kafka Stream comes special parameter processing.guarantee which could exactly_once or at_least_once which make life easy not to handle all parameters separately.
Kafka Streams atomically updates consumer offsets, local state stores,
state store changelog topics, and production to output topics all
together. If anyone of these steps fails, all of the changes are
rolled back.
processing.guarantee: exactly_once automatically provide below parameters you no need to set explicitly
isolation.level=read_committed
enable.idempotence=true
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5
Kafka stream offers the exactly-once semantic from the end-to-end point of view (consumes from one topic, processes that message, then produces to another topic). However, you mentioned only the producer's idempotent attribute. That is only a small part of the full picture.
Let me rephrase the question:
Why do we need the exactly-once delivery semantic at the consumer side
while we already have guaranteed the exactly-once delivery semantic at the
producer side?
Answer: Since the exactly-once delivery semantic is not only at the producing step but the full flow of processing. To achieve the exactly-once delivery semantically, there are some conditions must be satisfied with the producing and consuming.
This is the generic scenario: Process A produces messages to the topic T. At the same time, process B tries to consume messages from the topic T. We want to make sure process B never processes one message twice.
Producer part: We must make sure that producers never produce a message twice. We can use Kafka Idempotent Producer
Consumer part:
Here is the basic workflow for the consumer:
Step 1: The consumer pulls the message M successfully from the Kafka's topic.
Step 2: The consumer tries to execute the job and the job returns successfully.
Step 3: The consumer commits the message's offset to the Kafka brokers.
The above steps are just a happy path. There are many issues arises in reality.
Scenario 1: The job on step 2 executes successfully but then the consumer is crashed. Since this unexpected circumstance, the consumer has not committed the message's offset yet. When the consumer restarts, the message will be consumed twice.
Scenario 2: While the consumer commits the offset at step 3, it crashes due to hardware failures (e.g: CPU, memory violation, ...) When restarting, the consumer no way to know it has committed the offset successfully or not.
Because there are many problems might be happened, the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side. It doesn't mean we cannot but it takes a lot of effort to make sure the exactly-once delivery semantic. Kafka Stream upholds the work for engineers.
Noted that: Kafka Stream offers "exactly-once stream processing". It refers to consuming from a topic, materializing intermediate state in a Kafka topic and producing to one. If our application depends on some other external services (database, services...), we must make sure our external dependencies can guarantee exactly-once in those cases.
TL,DR: exactly-once for the full flow needs the cooperation between producers and consumers.
References:
Exactly-once semantics and how Apache Kafka does it
Transactions in Apache Kafka
Enabling exactly once Kafka streams
Read this article about message ordering in topic partition: https://blog.softwaremill.com/does-kafka-really-guarantee-the-order-of-messages-3ca849fd19d2
Allowing retries without setting max.in.flight.requests.per.connection
to 1 will potentially change the ordering of records because if two
batches are sent to a single partition, and the first fails and is
retried but the second succeeds, then the records in the second batch
may appear first.
According it there are two types of producer configs possible to achieve ordering guarantee:
max.in.flight.requests.per.connection=1 // can impact producer throughput
or alternative
enable.idempotence=true
max.in.flight.requests.per.connection //to be less than or equal to 5
max.retries // to be greater than 0
acks=all
Can anybody explain how second configuration achieves order guarantee? Also in the second config exactly-once semantics enabled.
idempotence:(Exactly-once in order semantics per partition)
Idempotent delivery enables the producer to write a message to Kafka exactly
once to a particular partition of a topic during the lifetime of a
single producer without data loss and order per partition.
Idempotent is one of the key features to achieve Exactly-once Semantics in Kafka. To set “enable.idempotence=true” eventually get exactly-once semantics per partition, meaning no duplicates, no data loss for a particular partition. If an error occurred even producer send messages multiple times will get written to Kafka once.
Kafka producer concept of PID and Sequence Number to achieve idempotent as explained below:
PID and Sequence Number
Idempotent producers use product id(PID) and sequence number while producing messages. The producer keeps incrementing the sequence number on each message published which map with unique PID. The broker always compares the current sequence number with the previous one and it rejects if the new one is not +1 greater than the previous one which avoids duplication and the same time if more than greater show lost in messages.
In a failure scenario it will still maintain sequence number and avoid duplication as shown below:
Note: When the producer restarts, new PID gets assigned. So the idempotency is promised only for a single producer session
If you are using enable.idempotence=true you can keep max.in.flight.requests.per.connection up to 5 and you can achieve order guarantee which brings better parallelism and improve performance.
Idempotence feature introduced in Kafka 0.11+ before we can achieve some level level of guaranteed using max.in.flight.requests.per.connection with retries and Acks setting:
max.in.flight.requests.per.connection to 1
max.retries bigger number
acks=all
max.in.flight.requests.per.connection=1: to make sure that while messages are retrying, additional messages will not be sent.
This gives guarantee at-least-once and comes with cost on performance and throughput and that's encourage introduced enable.idempotence feature to improve the performance and at the same time guarantee ordering.
exactly_once: To achieve exactly_once along with idempotence we need to set transaction as read_committed and will not allow to overwrite following parameters:
isolation.level:read_committed( Consumers will always read committed
data only)
enable.idempotence=true (Producer will always haveidempotency enabled)
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5 (Producer will
always have one in-flight request per connection)
enable.idempotence is a newer setting that was introduced as part of kip-98 (implemented in kafka 0.11+). before it users would have to set max.inflight to 1.
the way it works (abbreviated) is that producers now put sequence numbers on ourgoing produce batches, and brokers keep track of these sequence numbers per producer connected to them. if a broker receives a batch out of order (say batch 3 after 1) it rejects it and expects to see batch 2 (which the producer will retransmit). for complete details you should read kip-98
We're now moving one of our services from pushing data through legacy communication tech to Apache Kafka.
The current logic is to send a message to IBM MQ and retry if errors occur. I want to repeat that, but I don't have any idea about what guarantees the broker provide in that scenario.
Let's say I send 100 messages in a batch via producer via Java client library. Assuming it reaches the cluster, is there a possibility only part of it be accepted (e.g. a disk is full, or some partitions I touch in my write are under-replicated)? Can I detect that problem from my producer and retry only those messages that weren't accepted?
I searched for kafka atomicity guarantee but came up empty, may be there's a well-known term for it
When you say you send 100 messages in one batch, you mean, you want to control this number of messages or be ok letting the producer batch a certain amount of messages and then send the batch ?
Because not sure you can control the number of produced messages in one producer batch, the API will queue them and batch them for you, but without guarantee of batch them all together ( I'll check that though).
If you're ok with letting the API batch a certain amount of messages for you, here is some clues about how they are acknowledged.
When dealing with producer, Kafka comes with some kind of reliability regarding writes ( also "batch writes")
As stated in this slideshare post :
https://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign (83)
The original list of messages is partitioned (randomly if the default partitioner is used) based on their destination partitions/topics, i.e. split into smaller batches.
Each post-split batch is sent to the respective leader broker/ISR (the individual send()’s happen sequentially), and each is acked by its respective leader broker according to request.required.acks
So regarding atomicity.. Not sure the whole batch will be seen as atomic regarding the above behavior. Maybe you can assure to send your batch of message using the same key for each message as they will go to the same partition, and thus maybe become atomic
If you need more clarity about acknowlegment rules when producing, here how it works As stated here https://docs.confluent.io/current/clients/producer.html :
You can control the durability of messages written to Kafka through the acks setting.
The default value of "1" requires an explicit acknowledgement from the partition leader that the write succeeded.
The strongest guarantee that Kafka provides is with "acks=all", which guarantees that not only did the partition leader accept the write, but it was successfully replicated to all of the in-sync replicas.
You can also look around producer enable.idempotence behavior if you aim having no duplicates while producing.
Yannick
We've meet strange problem with flume-kafka-sink, kafka broker failed multiple times and producing duplicate messages(every 50 record are same), but the settings about producer.sinks.r.request.required.acks = 1, quota to kafka documentation "This option provides the lowest latency but the weakest durability guarantees (some data will be lost when a server fails)", It can't be produce duplicate data? Is that means the problem caused by flume or flume-kafka-sink?
Flume-Kafka-Sink produce messages batch by batch and will retry after some fail write. During some broker fail, some partition leaders can't reach. When a batch write happen, some parition will success, but some failed, when Flume-Kafka-Sink retry, the success part will be duplicated.
If I am using Kafka Async producer, assume there are X number of messages in buffer.
When they are actually processed on the client, and if broker or a specific partition is down for sometime, kafka client would retry and if a message is failed, would it mark the specific message as failed and move on to the next message (this could lead to out of order messages) ? Or, would it fail the remaining messages in the batch in order to preserve order?
I next to maintain the ordering, so would ideally want to kafka to fail the batch from the place where it failed, so I can retry from the failure point, how would I achieve that?
Like it says in the kafka documentation about retries
Setting a value greater than zero will cause the client to resend any
record whose send fails with a potentially transient error. Note that
this retry is no different than if the client resent the record upon
receiving the error. Allowing retries will potentially change the
ordering of records because if two records are sent to a single
partition, and the first fails and is retried but the second succeeds,
then the second record may appear first.
So, answering to your title question, no kafka doesn't have order guarantees under async sends.
I am updating the answers base on Peter Davis question.
I think that if you want to send in batch mode, the only way to secure it I would be to set the max.in.flight.requests.per.connection=1 but as the documentation says:
Note that if this setting is set to be greater than 1 and there are
failed sends, there is a risk of message re-ordering due to retries
(i.e., if retries are enabled).
Starting with Kafka 0.11.0, there is the enable.idempotence setting, as documented.
enable.idempotence: When set to true, the producer will ensure that
exactly one copy of each message is written in the stream. If false,
producer retries due to broker failures, etc., may write duplicates of
the retried message in the stream. Note that enabling idempotence
requires max.in.flight.requests.per.connection to be less than or
equal to 5, retries to be greater than 0 and acks must be all. If
these values are not explicitly set by the user, suitable values will
be chosen. If incompatible values are set, a ConfigException will be
thrown.
Type: boolean Default: false
This will guarantee that messages are ordered and that no loss occurs for the duration of the producer session. Unfortunately, the producer cannot set the sequence id, so Kafka can make these guarantees only per producer session.
Have a look at Apache Pulsar if you need to set the sequence id, which would allow you to use an external sequence id, which would guarantee ordered and exactly-once messaging across both broker and producer failovers.