Is it possible consumer Kafka messages after arrival? - apache-kafka

I would like to consume events from a kafka topic after the time they arrive. The time on which I want the event to be consumed is in the payload of the message. Is it possible to achieve something like that in Kafka? What are the drawbacks of it?
Practical example: a message M is produced at 12:10, arrives to my kafka topic at 12:11 and I want the consumer to poll it at 12:41 (30 minutes after arrival)

Kafka has a default retention period of all topic for 7 days. You can therefore consume up to a week's data at any moment, the drawback being network saturation if you are constantly doing this.
If you want to consume data that is not at the latest offset, then for any new consumer group, you would set auto.offset.reset=earliest. Otherwise for existing groups, you would need to use kafka-consumer-groups --reset command in order to re-consume an already consumed record.
Sometimes you may want to start from beginning of a topic, for example, if you have a compacted topic, in order to rebuild the "deltas" of the data within a topic - lookup the "Stream / Table Duality"
The time on which I want the event to be consumed is in the payload of the message
Since KIP-32 every message has a timestamp outside the payload, by the way
I want the consumer to poll it ... (30 minutes after arrival)
Sure, you can start a consumer whenever, as long as the data is within the retention window, you will get that event.
There isn't a way to finely control when that happens that other than acually making your consumer at that time, for example 30 minutes later. You could play with max.poll.records and max.poll.interval.ms, but I find anything larger than a few seconds really isn't a use-case for Kafka.
For example, you could rather have a TimerTask around a consumer thread, or Spark or MapReduce scheduled with an Oozie/Airflow task that reads a max amount of records.

Related

Kafka Streams: reprocessing old data when windowing

Having a Kafka Streams application, that performs windowing(using original event time, not wallclock time) via Stream joins of e.g. 1 day.
If bringing up this topology, and reprocessing the data from the start (as in a lambda-style architecture), will this window keep that old data there? da
For example: if today is 2022-01-09, and I'm receiving data from 2021-03-01, will this old data enter the table, or will it be rejected from the start?
In that case - what strategies can be done to reprocess this data?
UPDATE Using Kafka Streams 2.5.0
Updated Answer to OP Kafka Streams version 2.5:
When using event time, Kafka Streams will behave independent of the wallclock time, as long as no events contain the wallclock time. You should not have configured a WallclockTimestampExtractor as your timestamp extractor.
Kafka Streams will assign you input topic partitions to stream tasks, that will consume the partitions one event at a time. On any given topic, at most one partition will be assigned to a stream task. Time-windowed aggregations are carried out for each stream task separately. Kafka Streams uses an internal timestamp called "observedStreamTime" for each aggregation to keep track of the maximum timestamp seen so far. Incoming records are checked for their timestamp in comparison to the observedStreamTime. If they are older than the retention + grace period of the configured time window store, they will be dropped. Otherwise, they will be aggregated according to the configuration. The implementation can be found at https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L108-L175
This processing will always yield the same result, if the Kafka Streams application is reset. It is independent on the execution time of the processing. If events are dropped, the corresponding metrics are changed.
There is one caveat with this approach, when multiple topics are consumed. The observedStreamTime will reflect the highest timestamp of all partitions read by the stream task. If you have two topics (maybe because you want to join them) and one contains considerably younger data than the other (maybe because the latter received no new data), the observedStreamTime will be dominated by the younger topic. Events of the older topic might be dropped, if the time window configuration does not have enough retention or grace periods. See the JavaDoc of TimeWindows on the configuration options: https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindows.java
In your example the old data will be accepted, as long as the stream time has not progress too far. Reprocessing the whole data set should work, since it will linearly progress through your topic. If the old data is aggregated in a time-window with exceeding the window size + grace period, Kafka Streams will reject the record. In that case Kafka Streams will also issue an error message and adjust its metrics accordingly. So this behaviour should be easy to pick up.
I suggest to try out this reprocessing if feasible and watch the logs and metrics.

Get Kafka consume time for each group

I want to get the consumed time for each message by each group in Kafka, but I don't want the committed time that is persisted in consumer_offsets topic, because it is possible that message is consumed but is not committed and I want the consume time not committed time.
In nutshell, I need the consumed time for each message by each group.
Is there a way to get this time?

Use case: To construct a message that disppears after sometime in kafka

I have scenario where i want to send message to a alert service that would process the message and would send it to hipchat.
But I want the message to be active only for a minute. If hipchat is down (hypothetical) then the message should not be sent to hipchat.
I am using kafka so one of the service sends the message to kafka then the message is consumed by alert service(it polls the service) which processes the message (kafka consumer) while processing it checks that the time now and the time of the message is not greater than one minute. If not, it sends the message to hipchat aynchronously.
Enhancement:
I want a way to construct a self destruction message so that i automatically disappears after one minute. Is there a way to do it with kafka ? OR is there a better alternate than kafka (flink/sqs). If yes, how?
You can make use of the Kafka topic configurations retention.ms and delete.retention.ms as described in the Topic Level Configs.
The retention.ms should be set to 1 minute (60000 ms) and the delete.retention.ms should be set to 0 in your case. That way, the messages will stay in the Kafka Topic for one minute before they get deleted. However, that also means that you might loose messages if your consumer takes more then one minute to consume all messages (especially when reading a topic from beginning).
Details on those configurations are:
delete.retention.ms: The amount of time to retain delete tombstone markers for log compacted topics. This setting also gives a bound on the time in which a consumer must complete a read if they begin from offset 0 to ensure that they get a valid snapshot of the final stage (otherwise delete tombstones may be collected before they complete their scan).
retention.ms: This configuration controls the maximum time we will retain a log before we will discard old log segments to free up space if we are using the "delete" retention policy. This represents an SLA on how soon consumers must read their data. If set to -1, no time limit is applied.

Kafka Wheel Timer

Good Day,
I would like to find out if kafka queue can hold data for a few seconds and than release data.
I receive a message from a kafka topic,
After parsing the data, I hold it in memory for some time (10 seconds) (This builds up as unique messages come through), with each message having it's own timer), I want kafka to tell me that that message has expired (10 seconds) so that i can continue with other tasks.
But since flink/kafka is event driven, I was hoping kafka has some sort of round timing wheel that can reproduce the key for a message after 10 seconds to the consumer.
Any idea on how I can archieve this using flink windowing or kafka features?
Regards
Regarding your initial problem:
I would like to find out if kafka queue can hold data for a few seconds and than release data
You can set up log.cleanup.policy as delete (this is the default) and change the retention.ms from the default 604800000 (1 week) to 10000.
Can you explain again what else you want to check, and what did you mean after the Regards part?
You could look closer to Kafka Streams library. https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html, https://kafka.apache.org/21/documentation/streams/developer-guide/processor-api.html.
Using Kafka Streams you can do lot of complex event processing work. Processor API is lower level API and gives you more flexibility, ex Each processing message put in state store (Kafka Streams abstraction, that is replicated to changelog topic) and then with the Punctuator you can check if message expired.

Kafka producer buffering

Suppose there is a producer which is running and I run a consumer a few minutes later. I noticed that the consumer will consume old messages that has been produced by the producer but I don't want that happens. How can I do that? Is there any config parameters in broker to be set and solve this problem?
It really depends on the use case, you didn't really provide much information about the architecture. For instance - once the consumer is up, is it a long running consumer, or does it just wake up for a short while and consumes new messages arriving?
You can take any of the following approaches:
Filter ConsumerRecord by timestamp, so you will automatically throw away messages that were produced over configurable time.
In my team we're using ephemeral groups. That is - each time the service goes up, we generate a new group id for the consumer group, setting auto.offset.reset to latest
Seek to timestamp - since kafka 0.10 you can seek to a certain position. Use consumer.offsetsForTimes to get the offset of each topic partition for the desired time, and then use consumer.seek to get to the given offset.
If you use a consumer group, but never commit to kafka, then each time the a consumer is assigned to a topic partition, it will start consuming according to auto.offset.reset policy...