Does a Kafka Consumer receive a list of offsets first, before receiving the bytes/data? - apache-kafka

I'm quite new to Apache Kafka and I'm currently reading Learning Apache Kafka, 2ed, (2015). Chapter 3, paragraph Kafka Design fundamentals says the following:
Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages. Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes.
I'm a bit thrown off by the word 'acknowledge'. Do I understand it correctly that Kafka sends the offset first and then the consumer uses the list of offsets to pull request the data it has not consumed yet?
Thanks in advance,
Nick

On startup, KafkaConsumer issues a offset lookup request to the brokers for the specific consumer group that was configured on this consumer. If valid offsets are returned those are used. Otherwise, the consumer uses an initial offset according to auto.offset.reset parameter.
Afterwards, offsets are maintained mainly in-memory within the consumer. Each poll() sends the current offset to the broker and on broker reply consumer updates the in-memory offsets.
Additionally, in-memory offset are committed/acked to the broker from time to time. This can happen automatically within poll() if auto commit is enabled, or commit() must be called explicitly to send offsets to the broker for reliably storing them.

Related

If my service consumes Kafka messages, can kafka somehow lose my offsets?

If I have a service that connects to kafka as a message consumer, and every message I read I send a commit to that message offset, so that if my service shutsdown and restarts it will start reading from the last read message onwards. My understanding is that the committed offset will be maintained by kafka.
Now my question is, do I have to worry about the offset? Can kafka somehow lose that information and when the service restarts start reading messages from the beginning of the topic or the end of it depending on my initial offset config? Or if kafka loses my offset it will also have lost all messages in the topic so that it is alright to read from the beginning?
Note: I use spring-kafka on the service, but not sure if that is relevant to the question.
In most cases where you have an active consumer (with manual or auto-committing), you don't need to worry about it.
The cases where you do need to consider the behavior of auto.offset.reset setting is when the offsets.retention.minutes time on the broker has elapsed while your consumer group(s) are inactive. When this happens, Kafka compacts the __consumer_offsets topic and removes any offsets stored for those inactive groups
Losing offsets doesn't affect the source topic. Your client topic(s) have their own independent retention settings, and its message can be removed as well (or not), depending on how you've configured it.

Consume Kafka Message using poll mode

I am new to Kafka, and I am using Kafka 1.0.
I read the kafka messages using pull mode, that is, I periodically poll()ing the Kafka topic for new messages, but I didn't write the offset back to Kafka.
I would ask how kafka knows that which offsets I have consumed or what is the mechanism that Kafka remembers the progress(Kafka offset)
Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in an internal topic called (by default) __consumer_offsets (prior to v0.9 this information was stored on Zookeeper). When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.

How does Kafka guarantee consumers doesn't read a single message twice?

How does Kafka guarantee consumers doesn't read a single message twice?
Or is the above scenario possible?
Could the same message be read twice by single or by multiple consumers?
There are many scenarios which cause Consumer to consume the duplicate message
Producer published the message successfully but failed to acknowledge which cause to retry the same message
Producer publishing a batch of the message but failed partially published messages. In that case, it will retry and resent the same batch again which will cause duplicate
Consumers receive a batch of messages from Kafka and manually commit their offset (enable.auto.commit=false).
If consumers failed before committing to Kafka, next time Consumers will consume the same records again which reproduce duplicate on the consumer side.
To guarantee not to consume duplicate messages the job's execution and the committing offset must be atomic to guarantee exactly-once delivery semantic at the consumer side.
You can use the below parameter to achieve exactly one semantic. But please you have understood this comes with a compromise with performance.
enable idempotence on the producer side which will guarantee not to publish the same message twice
enable.idempotence=true
Defined Transaction (isolation.level) is read_committed
isolation.level=read_committed
In Kafka Stream above setting can be achieved by setting Exactly-Once
semantic true to make it as unit transaction
Idempotent
Idempotent delivery enables producers to write messages to Kafka exactly once to a particular partition of a topic during the lifetime of a single producer without data loss and order per partition.
Transaction (isolation.level)
Transactions give us the ability to atomically update data in multiple topic partitions. All the records included in a transaction will be successfully saved, or none of them will be. It allows you to commit your consumer offsets in the same transaction along with the data you have processed, thereby allowing end-to-end exactly-once semantics.
The producer doesn't wait to write a message to Kafka whereas the Producer uses beginTransaction, commitTransaction, and abortTransaction(in case of failure) Consumer uses isolation. level either read_committed or read_uncommitted
read_committed: Consumers will always read committed data only.
read_uncommitted: Read all messages in offset order without waiting
for transactions to be committed
Please refer more in detail refrence
It is absolutely possible if you don't make your consume process idempotent.
For example; you are implementing at-least-one delivery semantic and firstly process messages and then commit offsets. It is possible to couldn't commit offsets because of server failure or rebalance. (maybe your consumer is revoked at that time) So when you poll you will get same messages twice.
To be precise, this is what Kafka guarantees:
Kafka provides order guarantee of messages in a partition
Produced messages are considered "committed" when they were written to the partition on all its in-sync replicas
Messages that are committed will not be losts as long as at least one replica remains alive
Consumers can only read messages that are committed
Regarding consuming messages, the consumers keep track of their progress in a partition by saving the last offset read in an internal compacted Kafka topic.
Kafka consumers can automatically commit the offset if enable.auto.commit is enabled. However, that will give "at most once" semantics. Hence, usually the flag is disabled and the developer commits the offset explicitly once the processing is complete.

Can we have retention period of zero in Kafka broker?

Does retention period of zero makes sense in kafka borker?
We want to quickly forward message from producer to consumer via kafka broker. From buffercache/pagecache on broker machine without flushing to disk. We do not need replication and assume our broker will never crash.
When a message is produced to a Kafka topic it is written to the disk. Once the message has been consumed, the offset of this message is committed by the consumer (if you are using the high-level consumer API) however, there is no functionality that deletes only the messages that have been consumed (many consumers may subscribe to the same topic and some of them might have consumed that message while some others might have not).
What I would suggest in your case is to set a short retention period (which by default is set to 7 days) but allow a reasonable amount of time in order to allow your consumer to consume the messages. To do this, you simply need to configure the following parameter in server.properties:
log.retention.ms=X
Note that there is no guarantee that the deleted message(s) have been successfully consumed by your consumer(s). For example, if you set the retention period to 2 seconds (i.e. log.retention.ms=2000) and your consumer crashes, then every message which is sent to the topic while the consumer is down will be lost.

Should Kafka consumers be started before producers?

When I have a kafka console producer message produce some messages and then start a consumer, I am not getting the messages.
However i am receiving message produced by the producer after a consumer has been started.Should Kafka consumers be started before producers?
--from- beginning seems to give all messages including ones that are consumed.
Please help me with this on both console level and java client example for starting producer first and consuming by starting a consumer.
Kafka stores messages for a configurable amount of time. Default is a week. Consumers do not need to be "available" to receive messages, but they do need to know where they should start reading from
The console consumer has the default option of looking at the latest offset for all partitions. So if you're not actively producing data you see nothing as a consumer. You can specify a group flag for the console consumer or a Java client, and that's what tracks what offsets are read within the Kafka protocol and where a read request will resume from if you stopped that consumer in a group
Otherwise, I think you can only give an offset along with a single partition to consume from