kafka subscribe commit offset manually - apache-kafka

I am using Kafka 9 and confused with the behavior of subscribe.
Why does it expects group.id with subscribe.
Do we need to commit the offset manually using commitSync. Even if don't do that I see that it always starts from the latest.
Is there a way a replay the messages from beginning.

Why does it expects group.id with subscribe?
The concept of consumer groups is used by Kafka to enable parallel consumption of topics - every message will be delivered once per consumer group, no matter how many consumers actually are in that group. This is why the group parameter is mandatory, without a group Kafka would not know how this consumer should be treated in relation to other consumers that might subscribe to the same topic.
Whenever you start a consumer it will join a consumer group, based on how many other consumers are in this consumer group it will then be assigned partitions to read from. For these partitions it then checks whether a list read offset is known, if one is found it will start reading messages from this point.
If no offset is found, the parameter auto.offset.reset controls whether reading starts at the earliest or latest message in the partition.
Do we need to commit the offset manually using commitSync? Even if
don't do that I see that it always starts from the latest.
Whether or not you need to commit the offset depends on the value you choose for the parameter enable.auto.commit. By default this is set to true, which means the consumer will automatically commit its offset regularly (how often is defined by auto.commit.interval.ms). If you set this to false, then you will need to commit the offsets yourself.
This default behavior is probably also what is causing your "problem" where your consumer always starts with the latest message. Since the offset was auto-committed it will use that offset.
Is there a way a replay the messages from beginning?
If you want to start reading from the beginning every time, you can call seekToBeginning, which will reset to the first message in all subscribed partitions if called without parameters, or just those partitions that you pass in.

Related

Is there a common offset value that spans across Kafka partitions?

I am just experimenting on Kafka as a SSE holder on the server side and I want "replay capability". Say each kafka topic is in the form events.<username> and it would have a delete items older than X time set.
Now what I want is an API that looks like
GET /events/offset=n
offset would be the last processed offset by the client if not specified it is the same as latest offset + 1 which means no new results. It can be earliest which represents the earliest possible entry. The offset needs to exist as a security-through-obscurity check.
My suspicion is for this to work correctly the topic must remain in ONE partition and cannot scale horizontally. Though because the topics are tied to a user name the distribution between brokers would be handled by the fact that the topics are different.
If you want to retain event sequence for each of the per-user topics, then yes, you have to use one partition per user only. Kafka cannot guarantee message delivery order with multiple partitions.
The earliest and latest options you mention are already supported in any basic Kafka consumer configuration. The specific offset one, you'd have to filter out manually by issuing a request for the given offset, and then returning nothing if the first message you receive does not match the requested offset.

Kafka assigning partitions, do you need to commit offsets

Having an app that is running in several instances and each instance needs to consume all messages from all partitions of a topic.
I have 2 strategies that I am aware of:
create a unique consumer group id for each app instance and subscribe and commit as usual,
downside is kafka still needs to maintain a consumer group on behalf of each consumer.
ask kafka for all partitions for the topic and assign the consumer to all of those. As I understand there is no longer any consumer group created on behalf of the consumer in Kafka. So the question is if there still is a need for committing offsets as there is no consumer group on the kafka side to keep up to date. The consumer was created without assigning it a 'group.id'.
ask kafka for all partitions for the topic and assign the consumer to
all of those. As I understand there is no longer any consumer group
created on behalf of the consumer in Kafka. So the question is if
there still is a need for committing offsets as there is no consumer
group on the kafka side to keep up to date. The consumer was created
without assigning it a 'group.id'.
When you call consumer.assign() instead of consumer.subscribe() no group.id property is required which means that no group is required or is maintained by Kafka.
Committing offsets is basically keeping track of what has been processed so that you dont process them again. This may as well be done manually also. For example, reading polled messages and writing the offsets to a file once after the messages have been processed.
In this case, your program is responsible for writing the offsets and also reading from the next offset upon restart using consumer.seek()
The only drawback is, if you want to move your consumer from one machine to another, you would need to copy this file also.
You can also store them in some database that is accessible from any machine in case you don't want the file to be copied (though writing to a file may be relatively simpler and faster).
On the other hand, if there is a consumer group, so long as your consumer has access to Kafka, your Kafka will let your consumer automatically read from the last committed offset.
There will always be a consumer group setting. If you're not setting it, whatever consumer you're running will use its default setting or Kafka will assign one.
Kafka will keep track of the offset of all consumers using the consumer group.
There is still a need to commit offsets. If no offsets are being committed, Kafka will have no idea what has been read already.
Here is the command to view all your consumer groups and their lag:
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --all-groups

Kafka - timestamp order

Assume I'm using log.message.timestamp.type=LogAppendTime.
Also assume number of messages per topic/partition during first read:
topic0:partition0: 5
topic0:partition1: 0
topic0:partition2: 3
topic1:partition0: 2
topic1:partition1: 0
topic1:partition2: 4
and during second read:
topic0:partition0: 5
topic0:partition1: 2
topic0:partition2: 3
topic1:partition0: 2
topic1:partition1: 4
topic1:partition2: 4
If I read first message from each partition, does Kafka guarantee that reading again from each partition won't return a message that's older than those I read during first read?
Focus on topic0:partition1 and topic1:partition1 which didn't have any messages during first read, but have during second read.
Kafka guarantees message ordering at partition level, so your use case perfectly fits kafka's architecture.
There are some concepts to explain in here. First of all, you have the starting consumer position (when you first launch a new consumer group), defined by the auto.offset.reset parameter.
This will kick in only if there's no saved offset for that group, or if a saved offset is not valid anymore (f.e, if it was already deleted by retention policies). You should normally only worry for this if you launch a new consumer group (and you want to decide wether it starts from the oldest messages, or from the present - newest one).
Regarding your example, in normal conditions (there are no consumer shutdowns, etc), you have nothing to worry about. Consumers within a same consumer group will only read their messages once, no matter the number of partitions nor the number of consumers. These consumers remember their last read offset, and periodically save it in the _consumer_offsets topic.
There are 2 properties that define this periodical recording:
enable.auto.commit
Setting it to true (which is the default value) will allow the automatic commit to the _consumer_offsets topic.
auto.commit.interval.ms
Defines when the offsets are commited. For example, with a value of 10000, your consumer offsets will be stored every 10 seconds.
You can also set enable.auto.commit to false and store your offsets in your own way (f.e to a database, etc), but this is a more special use case.
The auto offset committing will allow you to stop your consumers, and start them again later without losing any message nor reprocessing already processed ones (it's like a mark in a book's page). If you don't stop your consumers (and without any errors from broker/zookeeper/consumers), even less worries for you.
For more info, you can take a look here: https://docs.confluent.io/current/clients/consumer.html#concepts
Hope it helps!

Kafka producer buffering

Suppose there is a producer which is running and I run a consumer a few minutes later. I noticed that the consumer will consume old messages that has been produced by the producer but I don't want that happens. How can I do that? Is there any config parameters in broker to be set and solve this problem?
It really depends on the use case, you didn't really provide much information about the architecture. For instance - once the consumer is up, is it a long running consumer, or does it just wake up for a short while and consumes new messages arriving?
You can take any of the following approaches:
Filter ConsumerRecord by timestamp, so you will automatically throw away messages that were produced over configurable time.
In my team we're using ephemeral groups. That is - each time the service goes up, we generate a new group id for the consumer group, setting auto.offset.reset to latest
Seek to timestamp - since kafka 0.10 you can seek to a certain position. Use consumer.offsetsForTimes to get the offset of each topic partition for the desired time, and then use consumer.seek to get to the given offset.
If you use a consumer group, but never commit to kafka, then each time the a consumer is assigned to a topic partition, it will start consuming according to auto.offset.reset policy...

FlinkKafkaConsumer08 offset control

I want to use FlinkKafkaConsumer08 to read a kafka topic. The messages are commands in terms of event-sourcing. I want to start from the end, not reading messages already in the topic.
I suppose there is a way to tell FlinkKafkaConsumer08 to start from the end.
How?
edit
I have tried setting "auto.offset.reset" property to "largest" with no result. I have tried enableCheckpoing too.
I have tried setting "auto.commit.interval.ms" to 1000. Then, at least, messages that have been previously processed are not processed again. This is a big improvements as, at least, commands are not executed twice, but it would be much better to discard old command messages. The solution I will adopt is to discard old messages based on date, and return error.
The auto.offset.reset property is only used if Kafka cannot find committed offsets in Kafka/ZooKeeper for the current consumer group. Thus, if you're reusing a consumer group, this property will most likely not be respected. However, starting the Kafka consumer in a new consumer group should do the trick.