Kafka Streaming application reading only latest message after connection with Kafka - apache-kafka

We are using Kafka Streaming library to build real time notification kind of system for incoming messages on a Kafka topic, so while the streaming app is running, it processes all incoming messages in a topic at real time and send notification if it encounter certain kind of pre-defined incoming message.
If in case the Streaming App is down and it is started again, we require to process only recent messages arriving after streaming app is initialized. This is to avoid processing old records which were not processed while streaming app was not running or down. By default the streaming App starts processing old messages since last committed offset. Is there any setting in Kafka Streaming App to allow processing only most recent message?

KafkaConsumer's 'auto.offset.reset' default value is 'latest'
but You want to use KafkaStreams, default is 'earliest'
reference : https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/StreamsConfig.java#L634
Therefore,
if set auto.offset.reset is 'latest' it will be what you want.

Your assumption is correct. Even if your set auto.offset.reset to latest, your app already have a consumer offset.
So you will have to reset the offsets to latest with the kafka-consumer-groups command with those options --reset-offsets --to-latest --execute.
Check the different reset scenarios , you can even reset to a particularly datetime, or by period, from a file etc..

Related

How to get only latest message when re-connecting to an existing consumer group in Kafka

In my case, it is a valid possibility that a consumer is offline for a longer period. During that offline period, events are still published to the topic.
When the consumer comes back online, it will re-use its existing consumer group, which has been lagging. Is it possible to skip forward to the latest message only? That is, ignore all earlier messages. In other words, I want to alter the offset to the latest message prior to consuming.
There is the spring.kafka.consumer.auto-offset property, but as far as I understand, this is only applicable for new consumer groups. Here, I am re-using an existing consumer group when the consumer comes back online. That said, if there is a possibility to automatically prune a consumer group when its consumer goes offline, this property could work, but I am not sure if such functionality exists?
I am working with the Spring Boot Kafka integration.
You can use consumer seek method after you calculate the last offset then subtract one from that, commit, and start polling.
Otherwise, simply don't use the same group. Generate a unique one and/or disable auto commits, then you're guaranteed to always use the auto.offset.reset config and lag is meaningless across app restarts

Topic messages disappears from confluent topics after refreshing page

Topic messages are disappearing from a topic when using confluent client. The only ones I can see (while not reloading page), are messages which I create using the "Produce" option in the same page. Kafka configurations are ok (I think), but I still don't understand what is wrong?
Looks like you are producing and consuming messages through a web browser.
Consumers typically subscribe to a topic and commit the offsets which have been consumed. The subsequent polls do not return the older messages (unless you do a seek operation) but only the newly produced messages.
The term disappearing may be applicable in two contexts:
As said above, consumer has already consumed that message and doesn't consume it again (because it has polled it already)
Your topic retention policy could be deleting older messages. You can check this, by using built in tools like kafka-console-consumer or kafka-avro-console-consumer with --from-beginning flag. If the messages are there means that is an issue with your consumer.
If you are calling consumer.poll() on every reload, then you will only get the messages after the previous call to poll (i.e. produced after the last reload). In case, you want all messages that have been present in the topic, since beginning or since sometime, you need to seek from beginning or since some timestamp or offset. See seek in KafkaConsumer

How to get notified about expired kafka events

Is there any mechanism to get notified (by a specific logfile entry or similar) in case an event within a kafka topic is expired due to retention policies? (I know this should avoided by design, but still).
I know about consumer lag monitoring tools for monitoring offset discrepancies between a published event and related consumer groups but they provide afaik only numbers (the offset difference).
In another simple words: How can we find out if kafka events were never consumed and therefore expired?
The log cleaner thread will output deletion events to the broker logs, but it'll reflect file segments not particular messages

Trigger Kafka Consumer on receiving data

I have a producer application which sends data to a Kafka topic, but only once in a while, as and when it receives from a source. I also have a consumer application (Spark) which keeps running all the time and receives data from Kafka when producer sends to it.
Since the consumer keeps running all the time, there is wastage of resources at times. Moreover, because my producer sends data only once in a while, is there any way to trigger the consumer application only when a kafka topic gets any data?
Sounds like you shouldn't be using Spark and would rather run some Serverless solution that can be triggered to run code on Kafka events.
Otherwise, run Cron to look at consumer lag. Define a threshold to submit your code at, then batch read from Kafka only then

Kafka Consumer Re-reading Messages

I've seen an issue where all my messages in my topic gets re-read by my consumer. I only have 1 consumer, and I turn it on/off while I'm developing/testing. I notice that sometimes after days of not running the consumer, when I turn it on again suddenly it re-reads all my messages.
The clientid and groupid stays the same throughout. I explicitly call commitSync, since my enable.auto.commit=false. I do set auto.offset.reset=earliest, but to my understanding that should only kick in if the offset is deleted on the server. I'm using IBM Bluemix's MessageHub service, so maybe that's automatically deleting an offset?
Does anyone have any clues/ideas?
Thanks.
Yes offsets are automatically deleted if you don't commit for 24hours.
This is the default setting with Kafka and we've not changed it.