Given two offsets - a start and end offset, or start/end datetime timestamp (equally fine), I want a Kafka consumer to replay all messages within that window.
I have figured out how to reset offset using kakfa-consumer-groups.sh tool to reset the offset based on datetime or offset, but how do I tell the consumer to stop after say replaying for 10,000 messages or 10 minutes?
There is no configuration or API that lets you stop a KafkaConsumer after a certain amount of processed offsets or time.
You would need to do this programatically by checking the offsets of the ConsumerRecord or having a timer that will stop the Consumer after a certain time.
Instead of using kafka-consumer-groups tool you could also make use of the seek API of the KafkaConsumer to start at a particular offsets for a partition.
Related
Can be created a new consumer group with a consumer which assigned to existing topiŃ, but somehow set a preference to consume backward: offset will move from the latest message for the moment to the earliest in every partition?
Kafka topics are meant to be consumed sequentually in the order of appearance within the topic partitions.
However, I see two options to solve your issue:
You can steer the consumer what data it poll from the topic partition like: Have your consumer seek to the latestet offset, then consume it and then seek to the latest offset minus one but read only one offset. Again seek to the previous offset and so on. Although I have never seen it, this should be possible with the consumer.seek and the ConsumerConfiguration max.poll.records.
You could use any kind of state store and order it descending by the offset for each partition. Then have another consumer reading the state store in the desired order.
Suppose there is a producer which is running and I run a consumer a few minutes later. I noticed that the consumer will consume old messages that has been produced by the producer but I don't want that happens. How can I do that? Is there any config parameters in broker to be set and solve this problem?
It really depends on the use case, you didn't really provide much information about the architecture. For instance - once the consumer is up, is it a long running consumer, or does it just wake up for a short while and consumes new messages arriving?
You can take any of the following approaches:
Filter ConsumerRecord by timestamp, so you will automatically throw away messages that were produced over configurable time.
In my team we're using ephemeral groups. That is - each time the service goes up, we generate a new group id for the consumer group, setting auto.offset.reset to latest
Seek to timestamp - since kafka 0.10 you can seek to a certain position. Use consumer.offsetsForTimes to get the offset of each topic partition for the desired time, and then use consumer.seek to get to the given offset.
If you use a consumer group, but never commit to kafka, then each time the a consumer is assigned to a topic partition, it will start consuming according to auto.offset.reset policy...
I'm using one topic, one partition, one consumer, Kafka client version is 0.10.
I got two different results:
If I paused partition first, then to produce a message and to invoke resume method. KafkaConsumer can poll the uncommitted message successfully.
But If I produced message first and didn't commit its offset, then to pause the partition, after several seconds, to invoke the resume method. KafkaConsumer would not receive the uncommitted message. I checked it on Kafka server using kafka-consumer-groups.sh, it shows LOG-END-OFFSET minus CURRENT-OFFSET = LAG = 1.
I have been trying to figure out it for two days, I repeated such tests a lot of times, the results are always like so. I need some suggestion or someone can tell me its Kafka's original mechanism.
For your observation#2, if you restart the application, it will supply you all records from the un-committed offset, i.e. the missing record and if your consumer again does not commit, it will be sent again when application registers consumer with Kafka upon restart. It is expected.
Assuming you are using consumer.poll() which creates a hybrid-streaming interface i.e. if accumulates data coming into Kafka for the duration mentioned and provides it to the consumer for processing once the duration is finished. This continuous accumulation happens in the backend and is not dependent on whether you have committed offset or not.
KafkaConsumer
The position of the consumer gives the offset of the next record that
will be given out. It will be one larger than the highest offset the
consumer has seen in that partition. It automatically advances every
time the consumer receives messages in a call to poll(long).
I have multiple consumers subscribed to one topic and all of them are in the same group. Sometimes it is necessary to force all the consumers to re-process some data which they already processed. In this case I have exact time to set consumer offsets to.
Question: Is there any mechanism in Kafka to set all consumer offsets to specified time? Here I mean that consumers must not be re-started, just their offsets should be magically set to new value and next poll requests would start fetching from new offset.
If it is impossible with standard Kafka, are there any ready libraries providing such a mechanism?
Probably you are looking for a seek method:
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout)...
I hope I am not making a mistake, but I remember that in Kafka documentation it mentioned that using high level APIs you can't start reading messages from a specific offset, but it was mentioned that it would change.
Is it possible now using the high level APIs to read messages from a specific partition and a specific offset? Could you please give me an example how to do it?
I am using kafka 0.8.1.1.
Thanks in advance.
You can do that with kafka 0.9:
http://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the
same partition more than once, the latest offset will be used on the
next poll(). Note that you may lose data if this API is arbitrarily
used in the middle of consumption, to reset the fetch offsets
Kafka 0.8.1.1 can use Zookeeper to store offsets for each consumer group. If you configure your consumer to commit offsets to zookeeper than you Need just to manually set the starting offset for the topic and partition under zookeeper for your consumer Group.
You Need to connect to zookeeper and use the set command:
set /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)
E.g. setting offset 10 for partition 0 of topicname for the spark-app consumer Group.
set /consumers/spark-app/offsets/topicname/0 10
When a consumer starts to consume message from Kafka it always starts to consume from the last committed offset. If this last committes offset is not.valid for any reason than the consumer applies the logic due the configurazione properties auto.offset.reset.
Hope this helps.