In Kafka, is it possible to read messages which are 1 min old? - apache-kafka

I’m trying to work on a usecase which requires the messages to be processed from a kafka topic which are 1 min old.
Is there a way in kafka to only read messages which are 1 minute ild ?
Thanks in advance.

The short answer is no.
Kafka consumers consume based on either getting the Latest message in the queue, or the earliest message.
See the docs
(Search for auto.offset.reset)
I think what you should do is hold a buffer of messages in your consuming application. Make your buffer only hold 1 minute's worth of messages and drop messages that are older than 1 minute. That way the oldest message in your buffer is always 1 minute old.
That's how I would do it.

You might be able to leverage the reset offset tooling introduced in 0.11.0.0. One issue is it is a command line tool and there is no programming API for it (yet). But you might be able to sync your application with the tool (or use the tool inside your application) to reset the offset of a partition to 1 minute ago and consume from there:
$ bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --group test.group --topic foo:0,1 --by-duration P1M
This resets the offset of partitions 0 and 1 of topic foo to the first message in each partition with timestamp after 1 minute ago. You can check the timestamp of the message to decide whether it qualifies for processing (according to your use case) or not.

You will be able to do this using Kafka Streams, state stores and processors. The below solution helps you to process the messages after 1 min, but you will still be consuming the messages immediately.
Create a state store and add it to a stream builder. Create a stream using that builder and add a processor using the above created state store. Use the processor supplier to process each of your messages. You can save all your messages in the state store using process(). Schedule the punctuate() to a 60000 millis and make the punctuate() to get the messages which passed the 1 min delay and process those.
Hope this helps.

Related

Use case: To construct a message that disppears after sometime in kafka

I have scenario where i want to send message to a alert service that would process the message and would send it to hipchat.
But I want the message to be active only for a minute. If hipchat is down (hypothetical) then the message should not be sent to hipchat.
I am using kafka so one of the service sends the message to kafka then the message is consumed by alert service(it polls the service) which processes the message (kafka consumer) while processing it checks that the time now and the time of the message is not greater than one minute. If not, it sends the message to hipchat aynchronously.
Enhancement:
I want a way to construct a self destruction message so that i automatically disappears after one minute. Is there a way to do it with kafka ? OR is there a better alternate than kafka (flink/sqs). If yes, how?
You can make use of the Kafka topic configurations retention.ms and delete.retention.ms as described in the Topic Level Configs.
The retention.ms should be set to 1 minute (60000 ms) and the delete.retention.ms should be set to 0 in your case. That way, the messages will stay in the Kafka Topic for one minute before they get deleted. However, that also means that you might loose messages if your consumer takes more then one minute to consume all messages (especially when reading a topic from beginning).
Details on those configurations are:
delete.retention.ms: The amount of time to retain delete tombstone markers for log compacted topics. This setting also gives a bound on the time in which a consumer must complete a read if they begin from offset 0 to ensure that they get a valid snapshot of the final stage (otherwise delete tombstones may be collected before they complete their scan).
retention.ms: This configuration controls the maximum time we will retain a log before we will discard old log segments to free up space if we are using the "delete" retention policy. This represents an SLA on how soon consumers must read their data. If set to -1, no time limit is applied.

Stream processing from a specific offset to an end offset

Is it possible to do kafka stream processing from a specific offset of input topic to an end offset?
I have one Kafka stream application which consume an input topic but for some reason it failed. I fixed the issue and started it again but it started consuming from the latest offset of the input topic. I know the offset of the input topic till which the application has processed. Now, how can I process the input topic from one offset to another. I am using confluent Platform 5.1.2.
By default, KStreams supports two possible values for auto.offset.reset. It could be either "earliest" or "latest". You can't set it to a specific offset in your application code.
There is an option during the application reset. If you use application reset script, you can use the --to-offset property and assign it to the specific offset. It will reset the application to that point.
<path-to-confluent>/bin/kafka-streams-application-reset --application-id app1 --input-topics a,b --to-offset 1000
You can find the details in the documentation :
https://docs.confluent.io/5.1.2/streams/developer-guide/app-reset-tool.html
In case, if you are fixing the bugs, it will be better to reset to the earliest state if possible.

Kafka Wheel Timer

Good Day,
I would like to find out if kafka queue can hold data for a few seconds and than release data.
I receive a message from a kafka topic,
After parsing the data, I hold it in memory for some time (10 seconds) (This builds up as unique messages come through), with each message having it's own timer), I want kafka to tell me that that message has expired (10 seconds) so that i can continue with other tasks.
But since flink/kafka is event driven, I was hoping kafka has some sort of round timing wheel that can reproduce the key for a message after 10 seconds to the consumer.
Any idea on how I can archieve this using flink windowing or kafka features?
Regards
Regarding your initial problem:
I would like to find out if kafka queue can hold data for a few seconds and than release data
You can set up log.cleanup.policy as delete (this is the default) and change the retention.ms from the default 604800000 (1 week) to 10000.
Can you explain again what else you want to check, and what did you mean after the Regards part?
You could look closer to Kafka Streams library. https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html, https://kafka.apache.org/21/documentation/streams/developer-guide/processor-api.html.
Using Kafka Streams you can do lot of complex event processing work. Processor API is lower level API and gives you more flexibility, ex Each processing message put in state store (Kafka Streams abstraction, that is replicated to changelog topic) and then with the Punctuator you can check if message expired.

Is it possible consumer Kafka messages after arrival?

I would like to consume events from a kafka topic after the time they arrive. The time on which I want the event to be consumed is in the payload of the message. Is it possible to achieve something like that in Kafka? What are the drawbacks of it?
Practical example: a message M is produced at 12:10, arrives to my kafka topic at 12:11 and I want the consumer to poll it at 12:41 (30 minutes after arrival)
Kafka has a default retention period of all topic for 7 days. You can therefore consume up to a week's data at any moment, the drawback being network saturation if you are constantly doing this.
If you want to consume data that is not at the latest offset, then for any new consumer group, you would set auto.offset.reset=earliest. Otherwise for existing groups, you would need to use kafka-consumer-groups --reset command in order to re-consume an already consumed record.
Sometimes you may want to start from beginning of a topic, for example, if you have a compacted topic, in order to rebuild the "deltas" of the data within a topic - lookup the "Stream / Table Duality"
The time on which I want the event to be consumed is in the payload of the message
Since KIP-32 every message has a timestamp outside the payload, by the way
I want the consumer to poll it ... (30 minutes after arrival)
Sure, you can start a consumer whenever, as long as the data is within the retention window, you will get that event.
There isn't a way to finely control when that happens that other than acually making your consumer at that time, for example 30 minutes later. You could play with max.poll.records and max.poll.interval.ms, but I find anything larger than a few seconds really isn't a use-case for Kafka.
For example, you could rather have a TimerTask around a consumer thread, or Spark or MapReduce scheduled with an Oozie/Airflow task that reads a max amount of records.

Verify if Kafka queue is empty

We have pushed messages onto kafka queues 2 days back and the retention is set to 2 days so its going to expire today . Is there any way of knowing exactly when kafka queues are empty/not having any data in them?
I am a beginner in Hadoop system so I don't know if there is any command to find this/easy way to verify the empty queues
You can write a own little tool using KafkaConsumer laveraging seekToEnd(), seekToBeginning(), and position() to get min and max offsets per partition. If both match for all partitions, the topic is empty.