Kafka paritioning with respect to timestamp to process only messages in parition of current hour parition - apache-kafka

I am working on a messaging system where messages are generated by user which have a time parameter.
I have a consumer which runs a job every hour and looks for messages that have time with current time and sends these messages.
I want to parition the topic "message" based on this time/timestamp in groups of one hour per parition, so that my consumer will only process one parition every hour instead of going through all messages every hour.
as of now i have a producer that will produce message in key value pair where key is the time with hour rounded off.
I have two questions :-
I can create 2-3 partitions by specifying in kafka settings but how can I have multiple patitions for every hour slot?
Should I rather create new topic for every hour and ask the consumer to only listen to the topic of the current hour?
example I am creating a topic named say "2020-7-23-10" which will contain all messages that need to be delivered between 10-11AM 23 July 2020. so I can just subscribe to this topic and process them.
or I can create a topic named "messages" and partition it based on the time and force my consumer to only process a particular patition.

Related

Is there any way I can maintain ordering (based on an attribute not by the message time) in a single partition of a kafka topic?

Let's say, this is a one-partition topic, and while consuming the message I want to read it in a sequence based on one of the attributes (Let's assume attr1) in the message.
Message 1 ('attr1'="01") was posted to that partition at 9:50 pm.
Message 2 ('attr1'="03") was posted to that partition at 9:55 pm.
Message 3 ('attr1'="02") was posted to that partition at 10:55 pm.
I want to consume it in the sequence based on the attr1 value, so Message1, Message3, and Message2 should be my consuming order.
No, that is not possible.
A fundamental thing to remember about Kafka is Offset. When you write a message to a partition - its always gets incremental offset.
In your example, if
message 1 gets offset 1
message 2 will get offset 2
message 3 will get offset 3
On the consumer side as well, message will always be read in sequence of increasing offsets. You can specify your consumer to start reading from a particular offset, but once it starts reading the message, the consumer will always get message in the sequence of increasing offset.
You can use alternative tools such as ksqlDB or Kafka Streams to first read the entire topic, then sort based on custom attributes, or use Punctuator class to delay processing based on time windows.
Otherwise, Kafka Connect can dump to a database, where you can query/sort based on columns/fields/attributes, etc.

Copy messages from one Kafka topic to another using offsets/timestamps

For some data processing , we need to reprocess all the messages between 2 timestamps
say between 1st Jan to 15th Jan.
to control upper bound we are planning to create a new topic that will have these messages so that once this task is complete , we can delete the topic too.
The new topic will have data from a particular offsets of source topic
partition 1 - from offset 100
partition 2 - from offset 2400...
and so on
What is the most suitable solution for this ? approx 10lacs messages fall in this.
Create a consumer from the source topic.
Call .assign for the partitions you want to copy
Call .seek for each starting offset of those partitions. You can use offsetsForTimes method to get them for a specific timestamp; then you can pass those on to the seek method.
Create a Producer
Start a poll loop (one thread per partition, ideally,each thread with the reference of the created producer).
As polling, check the timestamp of the record
If record timestamp exceeds the date you're reading to, stop the poll loop / thread
Else, send that data via the producer to your output topic

capturing the data which is just about to get discarded from kafka topic?

Kafka topic's retention period is 7 days. But I need to push data which is expiring because of retention period to new kafka topic or some other storage.
So is there any method where I can access the data which is going to be deleted after 7 days just before it gets deleted? or way to set up some process where it will automatically push data which is going to get deleted to some place else.
Since 0.10 version of kafka each message has a timestamp. Simply setup a consumer group that starts every hour and processes each topic partition from the initial offset (auto.offset.reset=earliest) and pushes on the new topic the messages with the timestamp with incoming expiration (one hour width), then the consumer group stops and is restarted one hour later.

Number of new messages received in a kafka topic per unit time

I am trying to chart number of new messages received per unit time (minute or hour) in a given Kafka topic.
I have seen posts around finding the number of current messages in a topic. As a potential solution I could query this number at each time interval, however, this doesn't account for expired messages (due to retention time).
Is there a way to get the number of new messages received in a kafka topic per unit time?
In JMX metrics you can find kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec which indicates the incoming message rate. you can store it in Prometheus or another time series database and query it based on time.

Is it possible consumer Kafka messages after arrival?

I would like to consume events from a kafka topic after the time they arrive. The time on which I want the event to be consumed is in the payload of the message. Is it possible to achieve something like that in Kafka? What are the drawbacks of it?
Practical example: a message M is produced at 12:10, arrives to my kafka topic at 12:11 and I want the consumer to poll it at 12:41 (30 minutes after arrival)
Kafka has a default retention period of all topic for 7 days. You can therefore consume up to a week's data at any moment, the drawback being network saturation if you are constantly doing this.
If you want to consume data that is not at the latest offset, then for any new consumer group, you would set auto.offset.reset=earliest. Otherwise for existing groups, you would need to use kafka-consumer-groups --reset command in order to re-consume an already consumed record.
Sometimes you may want to start from beginning of a topic, for example, if you have a compacted topic, in order to rebuild the "deltas" of the data within a topic - lookup the "Stream / Table Duality"
The time on which I want the event to be consumed is in the payload of the message
Since KIP-32 every message has a timestamp outside the payload, by the way
I want the consumer to poll it ... (30 minutes after arrival)
Sure, you can start a consumer whenever, as long as the data is within the retention window, you will get that event.
There isn't a way to finely control when that happens that other than acually making your consumer at that time, for example 30 minutes later. You could play with max.poll.records and max.poll.interval.ms, but I find anything larger than a few seconds really isn't a use-case for Kafka.
For example, you could rather have a TimerTask around a consumer thread, or Spark or MapReduce scheduled with an Oozie/Airflow task that reads a max amount of records.