Kafka assigning partitions, do you need to commit offsets - apache-kafka

Having an app that is running in several instances and each instance needs to consume all messages from all partitions of a topic.
I have 2 strategies that I am aware of:
create a unique consumer group id for each app instance and subscribe and commit as usual,
downside is kafka still needs to maintain a consumer group on behalf of each consumer.
ask kafka for all partitions for the topic and assign the consumer to all of those. As I understand there is no longer any consumer group created on behalf of the consumer in Kafka. So the question is if there still is a need for committing offsets as there is no consumer group on the kafka side to keep up to date. The consumer was created without assigning it a 'group.id'.

ask kafka for all partitions for the topic and assign the consumer to
all of those. As I understand there is no longer any consumer group
created on behalf of the consumer in Kafka. So the question is if
there still is a need for committing offsets as there is no consumer
group on the kafka side to keep up to date. The consumer was created
without assigning it a 'group.id'.
When you call consumer.assign() instead of consumer.subscribe() no group.id property is required which means that no group is required or is maintained by Kafka.
Committing offsets is basically keeping track of what has been processed so that you dont process them again. This may as well be done manually also. For example, reading polled messages and writing the offsets to a file once after the messages have been processed.
In this case, your program is responsible for writing the offsets and also reading from the next offset upon restart using consumer.seek()
The only drawback is, if you want to move your consumer from one machine to another, you would need to copy this file also.
You can also store them in some database that is accessible from any machine in case you don't want the file to be copied (though writing to a file may be relatively simpler and faster).
On the other hand, if there is a consumer group, so long as your consumer has access to Kafka, your Kafka will let your consumer automatically read from the last committed offset.

There will always be a consumer group setting. If you're not setting it, whatever consumer you're running will use its default setting or Kafka will assign one.
Kafka will keep track of the offset of all consumers using the consumer group.
There is still a need to commit offsets. If no offsets are being committed, Kafka will have no idea what has been read already.
Here is the command to view all your consumer groups and their lag:
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --all-groups

Related

Kafka: change in consumers number in a group

I understand that Kafka semantics is that a consumer group must read a record only once. To achieve this, Kafka consumers maintain an offset, which is then conveyed to brokers with read requests so that brokers can send data accordingly to ensure that already read data is not resend(). But how does broker and consumers react when their is a change in consumer group, like addition of a new consumer or an existing consumer going down?
There are few things which needs to be considered here.
A consumer goes down, then how is its offset information taken into
account while assigning its partitions to active consumers?
A new consumer joins, then how does system ensures that it doesn't read a
data its consumer group has already read?
If consumers join/leave a group, there's a consumer group rebalance. All consumers in the group will temporarily be suspended, then new partitions will be assigned to consume from.
If those consumers were processing, then there's a good chance that they'll re-consume the same data.
If you use transactions, the chance that happens could be a reduced as records will be consumed "exactly once". But this doesn't necessarily mean "successfully processed and offset committed" exactly once.

Does a Kafka Streams app commit offets for a topic used to fill a Global KTable?

I'm observing my Kafka Streams app reporting consumer lag for topics used to fill Global KTables. Is it correct that offsets are not committed for such topics?
This would make sense as the topic is read from the beginning on every startup, so keeping track of the offest in the consumer would be sufficient.
It would however be useful to know for monitoring to exclude such consumer topic pairs.
Correct, offsets are not committed for "global topics" -- the main reason is, that all KafkaStreams instances read all partitions and committing multiple offsets is not possible.
You can still access the "global consumer" metrics, that also report their local lag.

CURRENT-OFFSET and LAG of kafka consumer group that has no active members

How are these two set? Behaviour that I observe with kafka-consumer-groups.sh is that when new message is appended to a certain partition, it increments at first its LOG-END-OFFSET and LAG columns, and after some time, CURRENT-OFFSET column gets incremented and LAG column gets decremented, although no offset was actually commited by any consumer, as there are no active consumers. Am I right, and is this always happening with consumer groups that have no active members, or is there a possibility to turn off the second stage, that simulates commiting offsets by non-existing consumers? This is actually confusing, you have to take into account the information that there are no active members in a consumer group, in order to have the right perspective of what the CURRENT-OFFSET and LAG columns actually mean (not much in that case).
OK, it seems that the consumer actually does continuously connect and poll the messages and commits the offsets, but in a volatile fashion (disconnecting each time) so that kafka-consumer-groups.sh always reports as if there are no active members in a group.
This is a flink job that acts this way. Is that possible?
If the retention policy kicks in, and deletes old messages, the lag could decrease (if published logs are less than deleted ones), since the CURRENT-OFFSET positions itself at the earliest avaliable log.
I'd check what's the retention policy for your topic, since this may be due to deleted messages: The lag doesn't care about purgued messages, only active ones.
This has nothing to do with connecting to and disconnecting from the kafka cluster, that would be way to slow and ineffective. It has to do with the way that flink kafka consumer is implemented, which is described here: Flink Kafka Connector
The committed offsets are only a means to expose the consumer’s
progress for monitoring purposes.
What it basically does, it does not subscribe to topics as standard consumers that use consumer groups and their standard coordinators and leader mechanisms, but it directly assigns partitions, and only commits offsets to a consumer group for monitoring purposes, although it has methods of using these offsets for continuation too, see here, but anyway, that is why these groups appear to kafka as non having active members, and still getting offsets commited.

Is manipulating the "read-offset" as kafka consumer bad-practice?

We have an ongoing discussion about the correct (or intended) usage of Kafka for events.
The arguing point is the ability of a consumer to not only subscribe (or resubscribe) to a topic but also to modify its own read offset.
Am I right in saying that "A consumer should be design in a way that it never modifies its own read offset!"
Reasoning behind this:
The consumer cannot know what events actually are stored inside a topic (log retention)
... So restoring a complete state from "delta"-events is not possible.
The consumer has consumed an event once and confirmed this to the broker. why consuming again?
If your consumer instances belongs to same consumer group, consumer need not to keep the state of reading from topic. The state of reading is nothing but offset of topic up to which record your consumer read so far. If your topic has multiple partitions consumers belong to the same consumer group can distribute the work load among consumers. In case one of the consumers crashed or failed other consumers from same consumer group will be aware of from which partition offset they continue to consume the record.

How does Zookeeper/Kafka retain offset for a consumer?

Is the offset a property of the topic/partition, or is it a property of a consumer?
If it's a property of a consumer, does that mean multiple consumers reading from the same partition could have different offsets?
Also what happens to a consumer if it goes down, how does Kafka know it's dealing with the same consumer when it comes back online? presumably a new client ID is generated so it wouldn't have the same ID as previously.
In most cases it is a property of a Consumer Group. When writing the consumers, you normally specify the consumer group in the group.id parameter. This group ID is used to recover / store the latest offset from / in the special topic __consumer_offsets where it is stored directly in the Kafka cluster it self. The consumer group is used not only for the offset but also to ensure that each partition will be consumed only from a single client per consumer group.
However Kafka gives you a lot of flexibility - so if you need you can store the offset somewhere else and you can do it based on whatever criteria you want. But in most cases following the consumer group concept and storing the offset inside Kafka is the best thing you can do.
Kafka identifies consumer based on group.id which is a consumer property and each consumer should have this property
A unique string that identifies the consumer group this consumer belongs to. This property is required if the consumer uses either the group management functionality by using subscribe(topic) or the Kafka-based offset management strategy
And coming to offset it is a consumer property and broker property, whenever consumer consumes messages from kafka topic it will submit offset (which means consumed this list of messages from 1 to 10) next time it will start consuming from 10, offset can be manually submitted or automatically submitted enable.auto.commit
If true the consumer's offset will be periodically committed in the background.
And each consumer group will have its offset, based on that kafka server identifies either new consumer or old consumer was restarted