Kafka - consuming messages based on timestamp - apache-kafka

I'm kind of new to Kafka but need to implement the logic for the consumer to consume from a particular topic based on timestamp. Another use case is also for me to be able to consume for a particular time range (for example from 10:00 to 10:20). The range will always be dividable by 5 minutes - meaning I won't need to consume from for example 10:00 to 10:04). The logic I was thinking would be as follows:
create a table where I store timestamp and Kafka messageId (timestamp | id)
create a console\service which does the following every 5 minutes:
Get all partitions for a topic
Query all partitions for min offset value (a starting point)
Store the offset and timestamp in the table
Get all partitions for a topic
Now if everything is alright I should have something like this in the table:
10:00 | 0
10:05 | 100
10:10 | 200
HH: mm | (some number)
Now having this I could start the consumer at any time and knowing the offsets I should be able to consume just what I need.
Does it look right or have I made a flaw somewhere? Or maybe there is a better way of achieving the required result? Any thoughts or suggestions would be highly appreciated.
P.S.: one of my colleagues suggested to use partition and work out with each partition separately... Meaning if I got a topic and replica count is for example 5 - then I'd need to save offsets 5 times for my topic for every interval (once per partition). And then the consumer would also need to account for the partitions and consume based on what offsets I got for each partition. But this would kind of incorporate additional complexity which I am trying to avoid...
Thanks in advance!
BR,
Mike

No need for tables.
You can use the seek method of a Consumer instance to move all partitions to an offset defined by that partition.
Partitioning might work... 12 partitions of 5 minute message intervals
I don't think replication addresses your problem.

Related

Is there any way I can maintain ordering (based on an attribute not by the message time) in a single partition of a kafka topic?

Let's say, this is a one-partition topic, and while consuming the message I want to read it in a sequence based on one of the attributes (Let's assume attr1) in the message.
Message 1 ('attr1'="01") was posted to that partition at 9:50 pm.
Message 2 ('attr1'="03") was posted to that partition at 9:55 pm.
Message 3 ('attr1'="02") was posted to that partition at 10:55 pm.
I want to consume it in the sequence based on the attr1 value, so Message1, Message3, and Message2 should be my consuming order.
No, that is not possible.
A fundamental thing to remember about Kafka is Offset. When you write a message to a partition - its always gets incremental offset.
In your example, if
message 1 gets offset 1
message 2 will get offset 2
message 3 will get offset 3
On the consumer side as well, message will always be read in sequence of increasing offsets. You can specify your consumer to start reading from a particular offset, but once it starts reading the message, the consumer will always get message in the sequence of increasing offset.
You can use alternative tools such as ksqlDB or Kafka Streams to first read the entire topic, then sort based on custom attributes, or use Punctuator class to delay processing based on time windows.
Otherwise, Kafka Connect can dump to a database, where you can query/sort based on columns/fields/attributes, etc.

Bucketizing Kafka Data with Partitions

I have a situation where I’m loading data into Kafka. I would like to process the records in discrete 10m buckets. But bare in mind that the record time stamps come from the producers and so they may not be perfectly in the right order so I can’t simply use the standard Kafka consumer approach since that will result in records outside of my discrete bucket.
Is it possible to use partitions for this? I could look at the timestamp of each record before placing it in the topic, using that to select the appropriate partition. But I don’t know if Kafka supports adhoc named partitions.
They aren't "named" partitions. Sure, you could define a topic with 6 partitions (10 minute "buckets", ignoring hours and days) and a Partitioner subclass that computes which partition the record timestamp will go into with a simple math function, however, this is really only useful for ordering and doesn't address that you need to consume from two partitions for every non-exact 10 minute interval. E.g. records at minute 11 (partition 1) would need to consume records with minute 1-9 (partition 0).
Overall, sounds like you want sliding/hopping windowing features of Kafka Streams, not the plain Consumer API. And this will work without writing custom Producer Partitioners with any number of partitions.

Kafka paritioning with respect to timestamp to process only messages in parition of current hour parition

I am working on a messaging system where messages are generated by user which have a time parameter.
I have a consumer which runs a job every hour and looks for messages that have time with current time and sends these messages.
I want to parition the topic "message" based on this time/timestamp in groups of one hour per parition, so that my consumer will only process one parition every hour instead of going through all messages every hour.
as of now i have a producer that will produce message in key value pair where key is the time with hour rounded off.
I have two questions :-
I can create 2-3 partitions by specifying in kafka settings but how can I have multiple patitions for every hour slot?
Should I rather create new topic for every hour and ask the consumer to only listen to the topic of the current hour?
example I am creating a topic named say "2020-7-23-10" which will contain all messages that need to be delivered between 10-11AM 23 July 2020. so I can just subscribe to this topic and process them.
or I can create a topic named "messages" and partition it based on the time and force my consumer to only process a particular patition.

Unique messages from a kafka topic with in a time interval

I have a kafka topic to which 1500 message/sec are produced by different producers with each message having two fixed keys RID and Date, (there are other keys to which are varying for each message)
Is there a way to introduce a delay of 1 min in the topic and consume only unique messages in the 1 min window.
Example - In a minute there could be around 90K message in which there could be 1000(random value) message with RID as 1 and Date as 1st Jan 2020.
{"RID": "1" , "Date": "2020-01-01", ....}
I would like to consume only 1 message among 1000(any one among 1000 at random) after 1 minute is completed.
Note: There are 3 partitions for the topic.
What you want seems not to be possible. The brokers cannot deliver message based on your business logic, but they can only deliver all messages.
However, you could implement a client side cache to "de-duplicate" the messages accordingly, and only process a sub-set of the messages after "de-duplication".
I'm not completely sure about your question, but it seems you need the compaction log
It will remove the oldest messages from the topic, only need to configure the compaction for the topic and use as identifier the RID record.
Hope this can help you

Is it possible consumer Kafka messages after arrival?

I would like to consume events from a kafka topic after the time they arrive. The time on which I want the event to be consumed is in the payload of the message. Is it possible to achieve something like that in Kafka? What are the drawbacks of it?
Practical example: a message M is produced at 12:10, arrives to my kafka topic at 12:11 and I want the consumer to poll it at 12:41 (30 minutes after arrival)
Kafka has a default retention period of all topic for 7 days. You can therefore consume up to a week's data at any moment, the drawback being network saturation if you are constantly doing this.
If you want to consume data that is not at the latest offset, then for any new consumer group, you would set auto.offset.reset=earliest. Otherwise for existing groups, you would need to use kafka-consumer-groups --reset command in order to re-consume an already consumed record.
Sometimes you may want to start from beginning of a topic, for example, if you have a compacted topic, in order to rebuild the "deltas" of the data within a topic - lookup the "Stream / Table Duality"
The time on which I want the event to be consumed is in the payload of the message
Since KIP-32 every message has a timestamp outside the payload, by the way
I want the consumer to poll it ... (30 minutes after arrival)
Sure, you can start a consumer whenever, as long as the data is within the retention window, you will get that event.
There isn't a way to finely control when that happens that other than acually making your consumer at that time, for example 30 minutes later. You could play with max.poll.records and max.poll.interval.ms, but I find anything larger than a few seconds really isn't a use-case for Kafka.
For example, you could rather have a TimerTask around a consumer thread, or Spark or MapReduce scheduled with an Oozie/Airflow task that reads a max amount of records.