I am new to Kafka, and I am using Kafka 1.0.
I read the kafka messages using pull mode, that is, I periodically poll()ing the Kafka topic for new messages, but I didn't write the offset back to Kafka.
I would ask how kafka knows that which offsets I have consumed or what is the mechanism that Kafka remembers the progress(Kafka offset)
Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in an internal topic called (by default) __consumer_offsets (prior to v0.9 this information was stored on Zookeeper). When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
Related
Earlier kafka used to store consumer offsets in zookeeper, but since kafka 0.10 or 0.11 - kafka started to store consumer offsets in an internal topic.
As stated in this post -
Kafka brokers use an internal topic named __consumer_offsets that
keeps track of what messages a given consumer group last successfully
processed. As we know, each message in a Kafka topic has a partition
ID and an offset ID attached to it.
But a topic is not like a DB Table - which can be queried for data based on some input. So I am wondering how this is efficient at all and how exactly does kafka retrieve the offsets for a particular partiton for a particular consumer-group.
Kafka Streams or an in-memory hashtable can make compacted topics very much like an KV database store.
The consumer offsets topic is a compacted topic, keyed by group name. When you give a group.id in the client, the Controller node and Group Coordinator are easily able to lookup that name from the topic, by key, and return all currently committed offsets for all partitions for the group. Then the consumer looks up the offsets for its assigned partitions from the returned map.
It's not a question of "better". Removing dependencies of Zookeeper was always the goal, and is finally production ready as of Kafka 3.3.1.
I have a consumer polling from subscribed topic. It consumes each message and does some processing (within seconds), pushes to different topic and commits offset.
There are totally 5000 messages,
before restart - consumed 2900 messages and committed offset
after restart - started consuming from offset 0.
Even though consumer is created with same consumer group, it started processing messages from offset 0.
kafka version (strimzi) > 2.0.0
kafka-python == 2.0.1
We don't know how many partitions you have in your topic but when consumers are created within a same consumer group, they will consume records from different partitions ( We can't have two consumers in a consumer group that consume from the same partition and If you add a consumer the group coordinator will execute the process of Re-balancing to reassign each consumer to a specific partition).
I think the offset 0 comes from the property auto.offset.reset which can be :
latest: Start at the latest offset in log
earliest: Start with the earliest record.
none: Throw an exception when there is no existing offset data.
But this property kicks in only if your consumer group doesn't have a valid offset committed.
N.B: Records in a topic have a retention period log.retention.ms property so your latest messages could be deleted when your are processing the first records in the log.
Questions: While you want to consume message from one topic and process data and write them to another topic why you didn't use Kafka Streaming ?
I know that all the messages (or offset) in a Kafka Queue Partition has its offset number and it takes care of the sequence of offsets.
But if I have a Kafka Consumer Group (or single Kafka Consumer) which is reading particularly the Kafka Topic Partition then how it maintains up to which offset messages are read and who maintains this offset counter?
If the consumer goes down then how a new consumer will start reading the offset from the next unread (or not acknowledged) offset.
The information about Consumer Groups is all stored in the internal Kafka topic __consumer_offsets. Whenever a new group tries to read data from a topic it checks its offset position in that internal topic which has a deletion policy set to compact. The compaction keeps this topic small.
Kafka comes with a command line tool kafka-consumer-groups.sh that helps you understand which information is stored for each consumer group.
More information is given in the Kafka Documentation on offset tracking.
I have a use case where i have 2 consumers in different consumer groups(cg1 and cg2) subscribing to same topic(Topic A) with 4 partitions.
What happens if both consumers are reading from same partition and one of them failed and other one commited the offset?
In Kafka the offset management is done by Consumer Group per Partition.
If you have two consumer groups reading the same topic and even partition a commit from one consumer group will not have any impact to the other consumer group. The consumer groups are completely discoupled.
One consumer of a consumer group can read data from a single topic partition. A single consumer can't read data from multiple partitions of a topic.
Example Consumer 1 of Consumer Group 1 can read data of only single topic partition.
Offset management is done by the zookeeper.
__consumer_offsets: Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in this internal topic (prior to v0.9 this information was stored on Zookeeper).
When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
simultaneously two consumers from two different consumer groups(cg1 and cg2) can read the data from same topic.
In kafka 1: Offset management is taken care by zookeeper.
In kafka 2: offsets of each consumer is stored at __Consumer_offsets topic
Offset used for keeping the track of consumers (how much records consumed by consumers), let say consumer-1 consume 10 records and consumer-2 consume-20 records and suddenly consumer-1 got died now whenever the consumer-1 will up then it will start reading from 11th record onward.
I'm quite new to Apache Kafka and I'm currently reading Learning Apache Kafka, 2ed, (2015). Chapter 3, paragraph Kafka Design fundamentals says the following:
Consumers always consume messages from a particular partition sequentially and also acknowledge the message offset. This acknowledgement implies that the consumer has consumed all prior messages. Consumers issue an asynchronous pull request containing the offset of the message to be consumed to the broker and get the buffer of bytes.
I'm a bit thrown off by the word 'acknowledge'. Do I understand it correctly that Kafka sends the offset first and then the consumer uses the list of offsets to pull request the data it has not consumed yet?
Thanks in advance,
Nick
On startup, KafkaConsumer issues a offset lookup request to the brokers for the specific consumer group that was configured on this consumer. If valid offsets are returned those are used. Otherwise, the consumer uses an initial offset according to auto.offset.reset parameter.
Afterwards, offsets are maintained mainly in-memory within the consumer. Each poll() sends the current offset to the broker and on broker reply consumer updates the in-memory offsets.
Additionally, in-memory offset are committed/acked to the broker from time to time. This can happen automatically within poll() if auto commit is enabled, or commit() must be called explicitly to send offsets to the broker for reliably storing them.