I have a use case where I want the consumer to always start from the latest offset. I don't need to commit offsets for this consumer. This is not possible to achieve with spring-kafka, as a new consumer group always commits newly assigned partitions. Then, on subsequent starts of the program, the consumer reads from this stored offset, and not from the latest. In other words, only the very first start with a new consumer group behaves correctly, i.e. consuming from the latest. The problem is in KafkaMessageListenerContainer$ListenerConsumer.onPartitionsAssigned()
For reference, i set the following in spring boot
spring.kafka.listener.ack-mode=manual
spring.kafka.consumer.auto-offset-reset=latest
spring.kafka.consumer.enable-auto-commit=false
That code was added to solve some nasty race conditions when a repartition occurred while a new consumer group started consuming; it could cause lost or duplicate records, depending on configuration.
It was felt best to commit the initial offset to avoid these conditions.
I agree, though, that if the user takes complete responsibility for offsets (with a MANUAL ackmode) then we should probably not do that commit; it's up to the user code to deal with the race (in your case, you don't care about lost records).
Feel free to open a GitHub issue (contributions are welcome).
In the meantime, you can avoid the situation by having your listener implement ConsumerSeekAware and seek to the topic/partition ends during assignment.
Another alternative is to use a UUID for the group.id each time; and you will always start at the topic end.
Related
Recently, we had a production incident when Kafka consumers were repeatedly processing the same Kafka records again and again, and Kafka was rebalancing all the time. But I do not want to write here about this issue - we resolved it (by lowering the max-poll-records) and it works fine, now.
But the incident made me wonder - could we have lost some messages during this incident?
For instance: The documentation for auto-offset-reset says that this parameter applies "...if an offset is out of range". According to Kafka auto.offset.reset query it may happen e.g. "if the Consumer offset is less than the smallest offset". That is, if we had auto-offset-reset=latest and topic cleanup was triggered during the incident, we could have lost all the unprocessed data in the topic (because the offset would be set to the end of the topic, in this case). Therefore, IMO, it is never a good idea to have auto-offset-reset=latest if you need at-least-once delivery.
Actually, there are plenty of other situations where there is a threat of data loss in Kafka if not everything is set up correctly. For instance:
When the schema registry is not available, messages can get lost:
How to avoid losing messages with Kafka streams
After application restart, unprocessed messages are skipped despite that auto-offset-reset=earliest. We had this problem too in a topic (=not in every topic). Perhaps this is the same case.
etc.
Is there a cook-book how to set everything related to Kafka properly in order to make the application robust (with respect to Kafka) and prevent data loss? We've set up everything we consider important, but I'm not sure that we haven't overlooked something. And I cannot imagine all bad things that are possible in order to prevent them. For instance:
We have Kafka consumers with the same groupId running in different (geographically separated) networks. Does it matter? Nowadays probably not, but in the past probably yes, according to this answer.
I am consuming Kafka messages from a topic, but the issue is that every time the consumer restarts it reads older processed messages.
I have used auto.offset.reset=earliest. Will setting it manually using commit async help me overcome this issue?
I see that Kafka already has enabled auto commit to true by default.
I have used auto.offset.reset=earliest. Wwill setting it manually
using commit async help me overcome this issue?
When the setting auto.offset.reset=earliest is set the consumer will read from the earliest offset that is available instead of from the last offset. So, the first time you start your process with a new group.id and set this to earliest it will read from the starting offset.
Here is how we the issue can be debugged..
If your consumer group.id is same across every restart, you need to check if the commit is actually happening.
Cross check if you are manually overriding enable.auto.commit to false anywhere.
Next, check the auto commit interval (auto.commit.interval.ms) which is by default 5 sec and see if you have changed it to something higher and that you are restarting your process before the commit is getting triggered.
You can also use commitAsync() or even commitSync() to manually trigger. Use commitSync() (blocking call) for testing if there is any exception while committing. Few possible errors during committing are (from docs)
CommitFailedException - When you are trying to commit to partitions
that are no longer assigned to this consumer because the consumer is
for example no longer part of the group this exception would be thrown
RebalanceInProgressException - If the consumer instance is in the
middle of a rebalance so it is not yet determined which partitions
would be assigned to the consumer.
TimeoutException - if the timeout specified by
default.api.timeout.ms expires before successful completion of the
offset commit
Apart from this..
Also check if you are doing seek() or seekToBeginning() in your consumer code anywhere. If you are doing this and calling poll() you will likely get older messages also.
If you are using Embedded Kafka and doing some testing, the topic and the consumer groups will likely be created everytime you restart your test, there by reading from start. Check if it is a similar case.
Without looking into the code it is hard to tell what exactly is the error. This answer provides only an insight on debugging your scenario.
I have a CPP Kafka consumer which uses assign to specify the partitions. Since I assign the partitions using assign() and not use the subscribe() which I am fine with. Because of this, my the re-balancing doesn't take place which also I am fine with.
Question 1:
I want to understand how autocommit works here. Lets say if there are 2 consumers both of which have the same groupId. Both of them will get all the updates but could someone help me understand how commit will happen here ? If there is only one consumer, the commit happens using the consumer group id. But how does it work with 2 consumers. I don't see any commit failures as well in these cases.
Question 2:
How does rd_kafka_offsets_store work when I assign partition. Do they go along well or should I make use of subscribe in these cases ?
Two non-subscribing consumers with the same group.id will commit offsets for their assigned partitions without correlation or conflict resolution, if they're assigned the same partitions they will overwrite eachothers's commits.
Either use unique group.id's or subscribe to the topics.
rd_kafka_offsets_store() works the same way with assign() or subscribe(), namely by storing (in memory) the offset to commit on the next auto or manual commit.
I know about configuring kafka to read from earliest or latest message.
How do we include an additional option in case I need to read from a previous offset?
The reason I need to do this is that the earlier messages which were read need to be processed again due to some mistake in the processing logic earlier.
In java kafka client, there is some methods about kafka consumer which could be used to specified next consume position.
public void seek(TopicPartition partition,
long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the same partition more than once, the latest offset will be used on the next poll(). Note that you may lose data if this API is arbitrarily used in the middle of consumption, to reset the fetch offsets
This is enough, and there are also seekToBeginning and seekToEnd.
I'm trying to answer a similar but not quite the same question so let's see if my information may help you.
First, I have been working from this other SO question/answer
In short, you want to commit your offsets and the most common solution for that is ZooKeeper. So if your consumer encounters an error or needs to shut down, it can resume where it left off.
Myself I'm working with a high volume stream that is extremely large and my consumer (for a test) needs to start from the very tail each time. The documentation indicates I must use KafkaConsumer seek to declare my starting point.
I'll try to update my findings here once they are successful and reliable. For sure this is a solved problem.
I want to use FlinkKafkaConsumer08 to read a kafka topic. The messages are commands in terms of event-sourcing. I want to start from the end, not reading messages already in the topic.
I suppose there is a way to tell FlinkKafkaConsumer08 to start from the end.
How?
edit
I have tried setting "auto.offset.reset" property to "largest" with no result. I have tried enableCheckpoing too.
I have tried setting "auto.commit.interval.ms" to 1000. Then, at least, messages that have been previously processed are not processed again. This is a big improvements as, at least, commands are not executed twice, but it would be much better to discard old command messages. The solution I will adopt is to discard old messages based on date, and return error.
The auto.offset.reset property is only used if Kafka cannot find committed offsets in Kafka/ZooKeeper for the current consumer group. Thus, if you're reusing a consumer group, this property will most likely not be respected. However, starting the Kafka consumer in a new consumer group should do the trick.