This is my understanding of sending messages to Kafka asynchronously and implementing a callback function is:
Scenario: Producer is going to receive 4 batches of messages to send (to same partition, for the sake of simplicity).
Producer sends batch A.
Producer sends batch B.
Producers receives reply from broker and implements call back - batch A was unsuccessful and retriable, batch B was successful, so producer sends batch A again.
Won't this disturb the message ordering as now A is received by Kafka after B?
If you need message ordering within partition you can use idempotent producer:
enable.idempotence=true
akcs=all
max.in.flight.requests.per.connection<=5
retries>0
This will resolve potential duplicates from producer and maintain the ordering.
If you don't want to use idempotent producer then it is enough to set max.in.flight.requests.per.connection=1. This is the number of unacknowledged batches on the producer side. It means that batch B will not be sent before acknowledge for A is received.
Related
Imagine we need to send a sequence of the messages to Kafka using Spring Boot KafkaTemplete. We use a message loop like this:
list.forEach(value -> kafkaTemplate.send(topic, value));
Kafka producer is configured with acks=1 to ensure every message is received by the broker (let it be for the sake of simplicity that there is only one broker).
Imagine that on the middle of the loop, starting from nth message, for some reason the broker start failing to receive messages (stop sending success acknowledgements to the producer). Later, via the k iteration over the loop when sending n+kth message, the broker gets healthy and is able again to receive messages.
How can KafkaTemplete handle such a situation? Will Kafka producer remember that n to n+k-1 messages were not sent successfully and will buffer them in the memory, and somehow will implicitly try to send them before sending n+k messages? Or the unsent messages will be lost?
I have a use case where I want to implement synchronous request / response on top of kafka. For example when the user sends an HTTP request, I want to produce a message on a specific kafka input topic that triggers a dataflow eventually resulting in a response produced on an output topic. I want then to consume the message from the output topic and return the response to the caller.
The workflow is:
HTTP Request -> produce message on input topic -> (consume message from input topic -> app logic -> produce message on output topic) -> consume message from output topic -> HTTP Response.
To implement this case, upon receiving the first HTTP request I want to be able to create on the fly a consumer that will consume from the output topic, before producing a message on the input topic. Otherwise there is a possibility that messages on the output topic are "lost". Consumers in my case have a random group.id and have auto.offset.reset = latest for application reasons.
My question is how I can make sure that the consumer is ready before producing messages. I make sure that I call SubscribeTopics before producing messages. but in my tests so far when there are no committed offsets and kafka is resetting offsets to latest, there is a possibility that messages are lost and never read by my consumer because kafka sometimes thinks that the consumer registered after the messages have been produced.
My workaround so far is to sleep for a bit after I create the consumer to allow kafka to proceed with the commit reset workflow before I produce messages.
I have also tried to implement logic in a rebalance call back (triggered by a consumer subscribing to a topic), in which I am calling assign with offset = latest for the topic partition, but this doesn't seem to have fixed my issue.
Hopefully there is a better solution out there than sleep.
Most HTTP client libraries have an implicit timeout. There's no guarantee your consumer will ever consume an event or that a downstream producer will send data to the "response topic".
Instead, have your initial request immediately return 201 Accepted status (or 400, for example, if you do request validation) with some tracking ID. Then require polling GET requests by-id for status updates either with 404 status or 200 + some status field within the request body.
You'll need a database to store intermediate state.
When a consumer drops from a group and a rebalance is triggered, I understand no messages are consumed -
But does an in-flight request for messages stay queued passed the max wait time?
Or does Kafka send any payload back during the rebalance?
UPDATE
For clarification, I'm referring specifically to the consumer polling process.
From my understanding, when one of the consumers drop from the consumer group, a rebalance of the partitions to consumers is performed.
During the rebalance, will an error be sent back to the consumer if it's already polled and waiting for max time to pass?
Or does Kafka wait the max time and send an empty payload?
Or does Kafka queue the request passed max wait time until the rebalance is complete?
Bottom line - I'm trying to explain periodic timeouts from consumers.
This may be in the docs, but I'm not sure where to find it.
Kafka producers doesn't directly send messages to their consumers, rather they send them to the brokers.
The inflight requests corresponds to the producer and not to the consumer.
Whether the consumer leaves a group and a rebalance is triggered or not is quite immaterial to the behaviour of the producer.
Producer messages are queued in the buffer, batched, optionally compressed and sent to the Kafka broker as per the configuration.
In-flight requests are the maximum number of unacknowledged requests
the client will send on a single connection before blocking.
Note that when we say ack, it is acknowledgement by the broker and not by the consumer.
Does Kafka send any payload back during the rebalance?
Kafka broker doesn't notify of any rebalance to its producers.
I was going through document what I understood we can achieve exactly-once transaction with enabling idempotence=true
idempotence: The Idempotent producer enables exactly once for a
producer against a single topic. Basically each single message send
has stonger guarantees and will not be duplicated in case there's an
error
So if already we have idempotence then why we need another property exactly-once in Kafka Stream? What exactly different between idempotence vs exactly-once
Why exactly-once property not available in normal Kafka Producer?
In a distributed environment failure is a very common scenario that can be happened any time. In the Kafka environment, the broker can crash, network failure, failure in processing, failure while publishing message or failure to consume messages, etc.
These different scenarios introduced different kinds of data loss and duplication.
Failure scenarios
A(Ack Failed): Producer published message successfully with retry>1 but could not receive acknowledge due to failure. In that case, the Producer will retry the same message that might introduce duplicate.
B(Producer process failed in batch messages): Producer sending a batch of messages it failed with few published success. In that case and once the producer will restart it will again republish all messages from the batch which will introduce duplicate in Kafka.
C(Fire & Forget Failed) Producer published message with retry=0(fire and forget). In case of failure published will not aware and send the next message this will cause the message lost.
D(Consumer failed in batch message) A consumer receives a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false). If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
Exactly-Once semantics
In this case, even if a producer tries to resend a message, it leads
to the message will be published and consumed by consumers exactly once.
To achieve Exactly-Once semantic in Kafka, it uses below 3 property
enable.idempotence=true (address a, b & c)
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5(Producer will always have one in-flight request per connection)
isolation.level=read_committed (address d )
Enable Idempotent(enable.idempotence=true)
Idempotent delivery enables the producer to write a message to Kafka exactly
once to a particular partition of a topic during the lifetime of a
single producer without data loss and order per partition.
"Note that enabling idempotence requires MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION to be less than or equal to 5, RETRIES_CONFIG to be greater than 0 and ACKS_CONFIG be 'all'. If these values are not explicitly set by the user, suitable values will be chosen. If incompatible values are set, a ConfigException will be thrown"
To achieve idempotence Kafka uses a unique id which is called product id or PID and sequence number while producing messages. The producer keeps incrementing the sequence number on each message published which map with unique PID. The broker always compare the current sequence number with the previous one and it rejects if the new one is not +1 greater than the previous one which avoids duplication and same time if more than greater show lost in messages
In a failure scenario broker will compare the sequence numbers with the previous one and if the sequence not increased +1 will reject the message.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure)
Consumer uses isolation.level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
If a consumer with isolation.level=read_committed reaches a control message for a transaction that has not completed, it will not deliver any more messages from this partition until the producer commits or aborts the transaction or a transaction timeout occurs. The transaction timeout is determined by the producer using the configuration transaction.timeout.ms(default 1 minute).
Exactly-Once in Producer & Consumer
In normal conditions where we have separate producers and consumers. The producer has to idempotent and same time manage transactions so consumers can use isolation.level to read-only read_committed to make the whole process as an atomic operation.
This makes a guarantee that the producer will always sync with the source system. Even producer crash or a transaction aborted, it always is consistent and publishes a message or batch of the message as a unit once.
The same consumer will either receive a message or batch of the message as a unit once.
In Exactly-Once semantic Producer along with Consumer will appear as
atomic operation which will operate as one unit. Either publish and
get consumed once at all or aborted.
Exactly Once in Kafka Stream
Kafka Stream consumes messages from topic A, process and publish a message to Topic B and once publish use commit(commit mostly run undercover) to flush all state store data to disk.
Exactly-once in Kafka Stream is a read-process-write pattern that guarantees that this operation will be treated as an atomic operation. Since Kafka Stream caters producer, consumer and transaction all together Kafka Stream comes special parameter processing.guarantee which could exactly_once or at_least_once which make life easy not to handle all parameters separately.
Kafka Streams atomically updates consumer offsets, local state stores,
state store changelog topics, and production to output topics all
together. If anyone of these steps fails, all of the changes are
rolled back.
processing.guarantee: exactly_once automatically provide below parameters you no need to set explicitly
isolation.level=read_committed
enable.idempotence=true
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5
Kafka stream offers the exactly-once semantic from the end-to-end point of view (consumes from one topic, processes that message, then produces to another topic). However, you mentioned only the producer's idempotent attribute. That is only a small part of the full picture.
Let me rephrase the question:
Why do we need the exactly-once delivery semantic at the consumer side
while we already have guaranteed the exactly-once delivery semantic at the
producer side?
Answer: Since the exactly-once delivery semantic is not only at the producing step but the full flow of processing. To achieve the exactly-once delivery semantically, there are some conditions must be satisfied with the producing and consuming.
This is the generic scenario: Process A produces messages to the topic T. At the same time, process B tries to consume messages from the topic T. We want to make sure process B never processes one message twice.
Producer part: We must make sure that producers never produce a message twice. We can use Kafka Idempotent Producer
Consumer part:
Here is the basic workflow for the consumer:
Step 1: The consumer pulls the message M successfully from the Kafka's topic.
Step 2: The consumer tries to execute the job and the job returns successfully.
Step 3: The consumer commits the message's offset to the Kafka brokers.
The above steps are just a happy path. There are many issues arises in reality.
Scenario 1: The job on step 2 executes successfully but then the consumer is crashed. Since this unexpected circumstance, the consumer has not committed the message's offset yet. When the consumer restarts, the message will be consumed twice.
Scenario 2: While the consumer commits the offset at step 3, it crashes due to hardware failures (e.g: CPU, memory violation, ...) When restarting, the consumer no way to know it has committed the offset successfully or not.
Because there are many problems might be happened, the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side. It doesn't mean we cannot but it takes a lot of effort to make sure the exactly-once delivery semantic. Kafka Stream upholds the work for engineers.
Noted that: Kafka Stream offers "exactly-once stream processing". It refers to consuming from a topic, materializing intermediate state in a Kafka topic and producing to one. If our application depends on some other external services (database, services...), we must make sure our external dependencies can guarantee exactly-once in those cases.
TL,DR: exactly-once for the full flow needs the cooperation between producers and consumers.
References:
Exactly-once semantics and how Apache Kafka does it
Transactions in Apache Kafka
Enabling exactly once Kafka streams
Ok so I understand that you only get order guarantee per partition.
Just random thought/question.
Assuming that the partition strategy is correct and the messages are grouped correctly to the proper partition (or even say we are using 1 partition)
I suppose that the producing application must send each message 1 by 1 to kafka and make sure that each message has been acked before sending the next one right?
Yes, you are correct that the order the producing application sends the message dictates the order they are stored in the partition.
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
http://kafka.apache.org/documentation.html#intro_guarantees
However, if you have multiple messages in flight simultaneously I am not sure how order is determined.
You might want to think about the acks config for your producer as well. There are failure conditions where a message may be missing if the leader goes down after M1 is published and a new leader receives M2. In this case you won't have an out of order condition, but a missing message so it's slightly orthogonal to your original question but something to consider if message guarantees and order are critical to your application.
http://kafka.apache.org/documentation.html#producerconfigs
Overall, designing a system where small differences in order are not that important can really simplify things.
sync send message one by one(definitely slow!),
or async send message in batch with max.in.flight.requests.per.connection = 1
Yes, the Producer should be single threaded. If one uses multiple Producer threads to produce to the same partition, ordering guarantee on the Consumer will still be lost.So, ordering guarantee on the same partition implicitly also means a single Producer thread.
There are two strategies for sending messages in kafka : synchronous and asynchronous.
For synchronous type, it is intuitively that a producer send message one by one to the target partition, thus the message order is guaranteed.
For asynchronous type, messages are send using batching method, that is to say, if M1 is send prior to M2, then M1 is accumulated in the memory first, then the same with M2. So When producer sends batches of messages in a single request, the messages order thus will be guaranteed.