Kakfa Producer Message Delivery with acks=all and flush - apache-kafka

Creating Kafka Producer with "acks=all" config.
Is their any significance of calling flush with above config ?
Will it wait for flush to be invoked before being sent to broker.
AS
acks=all This means the leader will wait for the full set of in-sync
replicas to acknowledge the record. This guarantees that the record
will not be lost as long as at least one in-sync replica remains
alive. This is the strongest available guarantee. This is equivalent
to the acks=-1 setting.

As per documentation
flush():
Invoking this method makes all buffered records immediately available
to send (even if linger_ms is greater than 0) and blocks on the
completion of the requests associated with these records. The
post-condition of flush() is that any previously sent record will have
completed (e.g. Future.is_done() == True). A request is considered
completed when either it is successfully acknowledged according to the
‘acks’ configuration for the producer, or it results in an error.
Other threads can continue sending messages while one thread is
blocked waiting for a flush call to complete; however, no guarantee is
made about the completion of messages sent after the flush call
begins.
flush() will still block the client application until all messages are sent even with ack=0. The only thing is that it won't wait for an ack, the block is only until the buffer is sent out.
flush() with ack=all guarantees that the messages have been sent and has been replicated on the cluster with required replication factor.
Finally, to answer your question: Will it wait for flush to be invoked before being sent to broker?
Answer: Not necessarily. The producer keeps sending messages at an interval or by batch size (The buffer.memory controls the total amount of memory available to the producer for buffering). But, it's always good to flush() to make sure you send all messages.
Refer to this link for more information.

Let me first try and call out the distinction between flush() and acks before I get to the 2 questions.
flush() - This is a method to be invoked in the producer to push the messages to the brokers from the buffer (configurable) maintained on the producer side. You would either invoke this method or close() to send the messages through to the brokers from the producer buffer. This gets invoked automatically if the buffer memory available to the producer gets full (as described by Manoj in his answer).
acks=ALL is however a responsibility of the broker i.e. to send an acknowledgement back to the producer after the messages have synchronously replicated to other brokers as per the setting requested in the producer. You would use this setting to tune your message delivery semantics. In this case, as soon as the messages are replicated to the designated in-sync replicas, the broker will send the acknowledgement to the producer saying - "I got your messages".
Now, on your questions i.e. If there is any significance of calling flush with the acks setting and whether or not the producer will wait for flush to be invoked before being sent to the broker.
Well, the asynchronous nature of the producer will ensure that the producer does not wait. If however, you invoke flush() explicitly or if it gets invoked on its own then any further sends will be blocked until the producer gets the acknowledgement from the broker. So, the relationship between these 2 is very subtle.
I hope this helps!

Related

max-delivery-attempts does not work for un-acknowledged messages

I noted strange behavior in Artemins. I'm not sure if this is a bug or if I don't understand something.
I use Artemis Core API. I set autoCommitAcks to false. I noted that If message is received in MessageHandler but message is not acknowledged and session is rollbacked then Artemis does not consider this message as undelivered, Artemis consider this message as not sent to consumer at all. Parameter max-delivery-attempts does not work in this case. Message is redelivered an infinite number of times. Method org.apache.activemq.artemis.api.core.client.ClientMessage#getDeliveryCount returns 1 each time. Message has false value in Redelivered column in web console. If message is acknowledged before session rollback then max-delivery-attempts works properly.
What exactly is the purpose of message acknowledge? Acknowledge means only that message was received or acknowledge means that message was received and processed successfully? Maybe I can use acknowledge in both ways and it only depends on my requirements?
By message acknowledge I mean calling org.apache.activemq.artemis.api.core.client.ClientMessage#acknowledge method.
The behavior you're seeing is expected.
Core clients actually consume messages from a local buffer which is filled with messages from the broker asynchronously. The amount of message data in this local buffer is controlled by the consumerWindowSize set on the client's URL. The broker may dispatch many thousands of messages to various clients that sit in these local buffers and are never actually seen in any capacity by the consumers. These messages are considered to be in delivery and are not available to other clients, but they are not considered to be delivered. Only when a message is acknowledged is it considered to be delivered to a client.
If the client is auto-committing acknowledgements then acknowledging a message will quickly remove it from its respective queue. Once the message is removed from the queue it can no longer be redelivered because it doesn't exist anymore on the broker. In short, you can't get configurable redelivery semantics if you auto-commit acknowledgements.
However, if the client is not auto-committing acknowledgements and the consumer closes (for any reason) without committing the acknowledgements or calls rollback() on its ClientSession then the acknowledged messages will be redelivered according to the configured redelivery semantics (including max-delivery-attempts).

what is the purpose of kafka ack?

kafka consumer saves offsets only when it commits.
Thus , when it rise after crash , it can start from previous offset.
But what is the purpose of ack? in case of crash, ack wouldn't help(if they weren't commited)
Acks = 0
No response is requested from the broker, so if the broker goes offline or an exception happens, we will not know and will lose data. Useful for data where it's fine to potentially lose messages as metric or log collection.
The best performance is this because the producer will not wait for any confirmation.
Acks = 1 (Default)
Leader response is requested, but replication is not guarantee, this happens in background. If an acknowledgement is not received, the producer could retry without duplicate data. If the leader broker goes offline but replicas haven't replicated the data yet, we have data lost.
Acks = all
The leader and the replicas are requested by acknowledgement, this means that the producer has to wait to receive confirmation of any replica of the broker before to continue, this add latency but ensures safety, this setting ensure no lose data.

Kafka Consumer - continue calling poll() while paused?

I read the docs on using the pause and resume methods for a kafka consumer, and they seem easy enough to implement. However, do I need another thread to continue calling the poll() method while paused to meet the heartbeat requirements and not trigger a rebalance?
My consumer is running SQL scripts after polling the topic and depending the messages returned, the scripts may take longer than the current session.timeout.ms interval (we have increased this value, but the length of time for the scripts to run can vary quiet a bit and regardless of the interval we will exceed it at times). I also want to avoid a rebalance as safe ordering and data integrity are more important than throughput and error detention.
From version 0.10.1.0 heartbeat is sent via a separate thread so pausing your process thread wouldn't affect heartbeat thread.
You can check this for more information.
yes, you need to continue calling poll() on the consumer, even if you pause all partitions, or it will be kicked out of any consumer group its a member of and its assigned partitions will transfer to another consumer. as to which thread ends up calling poll - that doesnt matter (so long as only a single thread interacts with the consumer at a time)
quoting from kip-62:
max.poll.interval.ms. This config sets the maximum delay between client calls to poll(). When the timeout expires, the consumer will stop sending heartbeats and send an explicit LeaveGroup request.

What is the impact of acknowledgement modes in asynchronous send in kafka?

Does acknowledgement modes ('0','1','all') have any impact, when we send messages using asynchronous send() call?
I have tried measuring the send call latency (that is, by recording time before and after call to send() method) and observed that both (asynchronous send, acks=1) and (asynchronous send, acks=all) took same time.
However, there is a clear difference in throughput numbers.
producer.send(record, new ProducerCallback(record));
I thought send call will be blocked when we use acks=all even in asynchronous mode. Can someone explain how acknowledgement modes ('0','1','all') work in asynchronous mode?
According to the docs:
public Future send(ProducerRecord record,
Callback callback)
Asynchronously send a record to a topic and invoke the provided
callback when the send has been acknowledged. The send is asynchronous
and this method will return immediately once the record has been
stored in the buffer of records waiting to be sent. This allows
sending many records in parallel without blocking to wait for the
response after each one.
So, one thing is certain that the asynchronous "send" does not really care about what the "acks" config is. All it does is push the message into a buffer. Once this buffer starts getting processed (controlled by linger.ms and batch.size properties), then "acks" is checked.
If acks=0 -> Just fire and forget. Producer won't wait for an acknowledgement.
If acks=1-> Acknowledgement is sent by the broker when message is successfully written on the leader.
If acks=all -> Acknowledgement is sent by the broker when message is successfully written on all replicas.
In case of 1 and all, this becomes a blocking call as the producer will wait for an acknowledgement but, you might not notice this as it happens on a parallel thread. In case of acks=all, it is expected that it will take a little longer for the ack to arrive than acks=1 (network latency and number of replicas being the obvious reasons).
Further, you should configure "retries" property in your async-producer so that, if an acknowledgement fails (due to any reason e.g. packet corrupted/lost), the producer knows how many times it should try again to send the message (increase the guarantee of delivery).
Lastly: "However, there is a clear difference in throughput numbers." -- That's true because of the acknowledgement latency from the broker to the producer thread.
Hope that helps! :)

Why is the kafka consumer consuming the same message hundreds of times?

I see from the logs that exact same message is consumed by the 665 times. Why does this happen?
I also see this in the logs
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies
that the poll loop is spending too much time message processing. You can address this either by increasing the session
timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
Consumer properties
group.id=someGroupId
bootstrap.servers=kafka:9092
enable.auto.commit=false
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
session.timeout.ms=30000
max.poll.records=20
PS: Is it possible to consume only a specific number of messages like 10 or 50 or 100 messages from the 1000 that are in the queue?
I was looking at 'fetch.max.bytes' config, but it seems like it is for a message size rather than number of messages.
Thanks
The answer lies in the understanding of the following concepts:
session.timeout.ms
heartbeats
max.poll.interval.ms
In your case, your consumer receives a message via poll() but is not able to complete the processing in max.poll.interval.ms time. Therefore, it is assumed hung by the Broker and re-balancing of partitions happen due to which this consumer loses the ownership of all partitions. It is marked dead and is no longer part of a consumer group.
Then when your consumer completes the processing and calls poll() again two things happen:
Commit fails as the consumer no longer owns the partitions.
Broker identifies that the consumer is up again and therefore a re-balance is triggered and the consumer again joins the Consumer Group, start owning partitions and request messages from the Broker. Since the earlier message was not marked as committed (refer #1 above, failed commit) and is pending processing, the broker delivers the same message to consumer again.
Consumer again takes a lot of time to process and since is unable to finish processing in less than max.poll.interval.ms, 1. and 2. keep repeating in a loop.
To fix the problem, you can increase the max.poll.interval.ms to a large enough value based on how much time your consumer needs for processing. Then your consumer will not get marked as dead and will not receive duplicate messages.
However, the real fix is to check your processing logic and try to reduce the processing time.
The fix is described in the message you pasted:
You can address this either by increasing the session timeout or by
reducing the maximum size of batches returned in poll() with
max.poll.records.
The reason is a timeout is reached before your consumer is able to process and commit the message. When your Kafka consumer "commits", it's basically acknowledging receipt of the previous message, advancing the offset, and therefore moving onto the next message. But if that timeout is passed (as is the case for you), the consumer's commit isn't effective because it's happening too late; then the next time the consumer asks for a message, it's given the same message
Some of your options are to:
Increase session.timeout.ms=30000, so the consumer has more time
process the messages
Decrease the max.poll.records=20 so the consumer has less messages it'll need to work on before the timeout occurs. But this doesn't really apply to you because your consumer is already only just working on a single message
Or turn on enable.auto.commit, which probably also isn't the best solution for you because it might result in dropping messages though, as mentioned below:
If we allowed offsets to auto commit as in the previous example
messages would be considered consumed after they were given out by the
consumer, and it would be possible that our process could fail after
we have read messages into our in-memory buffer but before they had
been inserted into the database.
Source: https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html