Kafka Consumer Committing With Auto-Commit Disabled - apache-kafka

I'm missing events when reading from a Kafaka queue because the consumer is updating the offset without an explicit commit even when enable_auto_commit is disabled.
from kafka import KafkaClient, TopicPartition
topic_name = "my_topic"
consumer = KafkaConsumer(topic_name, group_id="me", enable_auto_commit=False)
for i, message in enumate(consumer):
if i == 5:
expected = message.offset
print(expected)
break
else:
consumer.commit()
for message in consumer:
value = message.offset
print(value)
consumer.commit()
break
produces
>>> 50078
>>> 50079
If I don't commit the read, shouldn't it be read again the next time I start consuming messages?

It looks like you're using the kafka-python, am I correct? If so, there's a GitHub issue that explains a workaround, although I don't see a solution from the maintainer. So, although I don't have a reason why you're encountering this, based on that issue, I can recommend you commit the message at least once.

Related

Why do the offsets of the consumer-group (app-id) of my Kafka Streams Application get reset after application restart?

I have a Kafka Streams application for which, whenever I restart it, the offsets for the topic it is consuming get reset. Hence, for all partitions, the lags increase and the app needs to reprocess all the data.
UPDATE:
The output topic is receiving a burst of events that were already processed after the App gets restarted, is not that the input topic offsets are getting reset as I said in the previous paragraph. However, the internal topic (KTABLE-SUPPRESS-STATE-STORE) offsets are getting reset, see comments below.
I have ensured the lag is 1 for every partition before the restart (this is for the output topic).
All consumers that belong to that consumer-group-id (app-id) are active.
The restart is immediate, it takes around 30 secs.
The app is using exactly once as processing guarantee.
I have read this answer How does an offset expire for an Apache Kafka consumer group? .
I have tried with auto.offset.reset = latest and auto.offset.reset = earliest.
It seems like the offsets for these topics are not effectively committed, (but I am not sure about this).
I assume that after the restart the app should pick-up from the latest committed offset for that consumer group.
UPDATE:
I assume this for the internal topic (KTABLE-SUPPRESS-STATE-STORE)
Does the Kafka Stream API ensure to commit all consumed offset before shutting down? (after calling streams.close())
I would really appreciate any clue about this.
UPDATE:
This is the code the App execute:
final StreamsBuilder builder = new StreamsBuilder();
final KStream<..., ...> events = builder
.stream(inputTopicNames, Consumed.with(..., ...)
.withTimestampExtractor(...);
events
.filter((k, v) -> ...)
.flatMapValues(v -> ...)
.flatMapValues(v -> ...)
.selectKey((k, v) -> v)
.groupByKey(Grouped.with(..., ...))
.windowedBy(
TimeWindows.of(Duration.ofSeconds(windowSizeInSecs))
.advanceBy(Duration.ofSeconds(windowSizeInSecs))
.grace(Duration.ofSeconds(windowSizeGraceInSecs)))
.reduce((agg, new) -> {
...
return agg;
})
.suppress(Suppressed.untilWindowCloses(
Suppressed.BufferConfig.unbounded()))
.toStream()
.to(outPutTopicNameOfGroupedData, Produced.with(..., ...));
The offset reset just and always happens (after restarting) with the KTABLE-SUPPRESS-STATE-STORE internal topic created by the Kafka Stream API.
I have tried with the Processing guarantee exactly once and at least once.
Once again, I will really appreciate any clue about this.
UPDATE:
This has been solved in the release 2.2.1 (https://issues.apache.org/jira/browse/KAFKA-7895)
The offset reset just and always happens (after restarting) with the KTABLE-SUPPRESS-STATE-STORE internal topic created by the Kafka Stream API.
This is currently (version 2.1) expected behavior, because the suppress() operator works in-memory only. Thus, on restart, the suppress buffer must be recreate from the changelog topic before processing can start.
Note, it is planned to let suppress() write to disk in future releases (cf. https://issues.apache.org/jira/browse/KAFKA-7224). This will avoid the overhead of recreating the buffer from the changelog topic.
I think #Matthias J. Sax 's reply covers most of the internals of suppress. One thing I need to clarify though: when you say "restart the application", what exactly did you do? Did you shutdown the whole application gracefully, and then restart it?
Commit frequency is controlled by the parameter commit.interval.ms. Check whether your offsets are indeed committed. By default, offsets are committed every 100 ms or 30 secs, depending upon your processing guarantee config. Check this out

Commit issue with Kafka

I am working on a module and the requirement is there is a producer and we are using kafka as queue for data producing and feeding it to consumer.
Now In consumer,I am trying to implement At-Least-Once messaging scenario.
For this i have to pool the messages from kafka and then consumer those.After consuming i am calling consumer.commitAsync(offset,Callback).
I want to know what will happen
Case 1). when commitAsync() api is never called(suppose there was an exception just before calling this api).In my case,I was supposing the message will be pumped again to consumer; but it is not happening.Consumer never get that data again.
Case 2). if the consumer reboots.
Below is the code snippet of properties set with the consumer
private Properties getConsumerProperties() {
final Properties props = new Properties();
props.put(BOOTSTRAP_SERVERS_CONFIG, "server");
props.put(GROUP_ID_CONFIG, "groupName");
props.put(ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(HEARTBEAT_INTERVAL_MS_CONFIG, heartBeatinterval);
props.put(METADATA_MAX_AGE_CONFIG, metaDataMaxAge);
props.put(SESSION_TIMEOUT_MS_CONFIG, sessionTimeout);
props.put(AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
props.put(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
props.put(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
return props;
}
Now in consumer, on the basis of some property set; i have 3 topics and create 3 consumers for each topic(as there are 3 partition and 3 brokers of kafka).
For consumption of data...I identify the packet in the basis of some propertywhen received fron kafka..and pass it to the relevant topic(i have taken a different thread pools for different topics and create the tasks on the basis of property in the packet and submit to thread pool).In the tasks, after processing i call the consumer.commitAsync(offset,callback).
I was expecting the same message to be pulled again from kafka in case of commitAsync is not called for some packet...but to my surprise it is not coming back...Am i missing something.Is there any sort of setting we need to do in the apache-kafka as well for At-Least-One.
Please suggest.
There is couple of things to be addressed in your question.
Before I get to the suggestions on how to achieve at-least-once behavior, I'll try and address the 2 cases:
Case 1). when commitAsync() api is never called(suppose there was an
exception just before calling this api).In my case,I was supposing the
message will be pumped again to consumer; but it is not
happening.Consumer never get that data again.
The reason why your consumer does not get the old data could be because of the enable.auto.commit property, this is set to true by default and will commit the offsets regularly in the background. Due to this, the consumer on subsequent runs will find an offset to work with and will just wait for new data/messages to arrive.
Case 2). if the consumer reboots.
This would also be similar i.e. if the consumer after rebooting finds a committed offset to work with, it will start consuming from that offset whether the offset was committed automatically due to the enable.auto.commit property set to true or by invoking commitAsync()/commitSync() explicitly.
Now, moving to the part on how to achieve at-least-once behavior - I could think of the following 2 ways:
If you want to take control of committing offsets, then set the "enable.auto.commit" property to false and then invoke commitSync() or commitAsync() with retries handled in the Callback function.
Note: The choice of Synchronous vs Asynchronous commit will depend on your latency budget and any other requirements. So, not going too much into those details here.
The other option is to utilise the automatic offset commit feature i.e. setting enable.auto.commit to true and auto.commit.interval.ms to an acceptable number (again, based on your requirements on how often would you like to commit offsets).
I think the default behaviour of Kafka is centered around at-least-once semantics, so it should be fairly straightforward.
I hope this helps!

Is there a way to stop Kafka consumer at a specific offset?

I can seek to a specific offset. Is there a way to stop the consumer at a specific offset? In other words, consume till my given offset. As far as I know, Kafka does not offer such a function. Please correct me if I am wrong.
Eg. partition has offsets 1-10. I only want to consume from 3-8. After consuming the 8th message, program should exit.
Yes, kafka does not offer this function, but you could achieve this in your consumer code. You could try use commitSync() to control this.
public void commitSync(Map offsets)
Commit the specified offsets for the specified list of topics and partitions.
This commits offsets to Kafka. The offsets committed using this API will be used on the first fetch after every rebalance and also on startup. As such, if you need to store offsets in anything other than Kafka, this API should not be used. The committed offset should be the next message your application will consume, i.e. lastProcessedMessageOffset + 1.
This is a synchronous commits and will block until either the commit succeeds or an unrecoverable error is encountered (in which case it is thrown to the caller).
Something like this:
while (goAhead) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
if (record.offset() > OFFSET_BOUND) {
consumer.commitSync(Collections.singletonMap(new TopicPartition(record.topic(), record.partition()), new OffsetAndMetadata(record.offset())));
goAhead = false;
break;
}
process(record);
}
}
You should set the "enable.auto.commit" to false in code above. In your case the OFFSET_BOUND could be set to 8. Because the commited offset is just 9 in your example, So next time the consumer will fetch from this position.
Assuming that partition offsets are continuous (i.e. not log compacted) you could configure your consumer (using max.poll.records config) so it reads certain number of records in each poll. This would let you stop at the offset you want.
As I know max.poll.records is a client feature. Kafka fetch protocol has only bytes limitations https://kafka.apache.org/protocol#The_Messages_Fetch
So you will read more messages under hood in general

Apache Kafka : commitSync after pause

In our code, we plan to manually commit the offset. Our processing of data is long run and hence we follow the pattern suggested before
Read the records
Process the records in its own thread
pause the consumer
continue polling paused consumer so that it is alive
When the records are processed, commit the offsets
When commit done, then resume the consumer
The code somewhat looks like this:
while (true) {
ConsumerRecords<String, String> records = consumer.poll(kafkaConfig.getTopicPolling());
if (!records.isEmpty()) {
task = pool.submit(new ProcessorTask(processor, createRecordsList(records)));
}
if (shouldPause(task)) {
consumer.pause(listener.getPartitions());
}
if (isDoneProcessing(task)) {
consumer.commitSync();
consumer.resume(listener.getPartitions());
}
}
If you notice, we commit using commitSync() (without any parameters).
Since the consumer is paused, in the next iteration we would get no records. But commitSync() would happen later. In that case which offset's would it try to commit? I have read the definitive guide and googled but cannot find any information about it.
I think we should explicitly save the offsets. But I am not sure if the current code would be an issue.
Any information would be helpful.
Thanks,
Prateek
If you call consumer.commitSync() with no parameters it should commit the latest offset that your consumer has received. Since you can receive many messages in a single poll() you might want to have finer control over the commit and explicitly commit a specific offset such as the latest message that your consumer has successfully processed. This can be done by calling commitSync(Map<TopicPartition,OffsetAndMetadata> offsets)
You can see the syntax for the two ways to call commitSync here in the Consumer Javadoc http://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#commitSync()

Kafka Consumer - Poll behaviour

I'm facing some serious problems trying to implement a solution for my needs, regarding KafkaConsumer (>=0.9).
Let's imagine I have a function that has to read just n messages from a kafka topic.
For example: getMsgs(5) --> gets next 5 kafka messages in topic.
So, I have a loop that looks like this. Edited with actual correct parameters. In this case, the consumer's max.poll.records param was set to 1, so the actual loop only iterated once. Different consumers(some of them iterated through many messages) shared an abstract father (this one), that's why it's coded that way. The numMss part was ad-hoc for this consumer.
for (boolean exit= false;!exit;)
{
Records = consumer.poll(config.pollTime);
for (Record r:records)
{
processRecord(r); //do my things
numMss++;
if (numMss==maximum) //maximum=5
{
exit=true;
break;
}
}
}
Taking this into account, the problem is that the poll() method could get more than 5 messages. For example, if it gets 10 messages, my code will forget forever those other 5 messages, since Kafka will think they're already consumed.
I tried commiting the offset but doesn't seem to work:
consumer.commitSync(Collections.singletonMap(partition,
new OffsetAndMetadata(record.offset() + 1)));
Even with the offset configuration, whenever I launch again the consumer, it won't start from the 6th message (remember, I just wanted 5 messages), but from the 11th (since the first poll consumed 10 messages).
Is there any solution for this, or maybe (most surely) am I missing something?
Thanks in advance!!
You can set max.poll.records to whatever number you like such that at most you will get that many records on each poll.
For your use case that you stated in this problem you don't have to commit offsets explicitly by yourself. you can just set enable.auto.commit to trueand set auto.offset.reset to earliest such that it will kick in when there is no consumer group.id (other words when you are about start reading from a partition for the very first time). Once you have a group.id and some consumer offsets stored in Kafka and in case your Kafka consumer process dies it will continue from the last committed offset since it is the default behavior because when a consumer starts it will first look for if there are any committed offsets and if so, will continue from the last committed offset and auto.offset.reset won't kick in.
Had you disabled auto commit by setting enable.auto.commit to false. You need to disable that if you want to manually commit the offset. Without that next call to poll() will automatically commit the latest offset of the messages you received from previous poll().
From Kafka 0.9 the auto.offset.reset parameter names have changed;
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.
set auto.offset.reset property as "earliest". Then try consume, you will get the consumed records from the committed offset.
Or you use consumer.seek(TopicPartition, offset) api before poll.