Unique messages from a kafka topic with in a time interval - apache-kafka

I have a kafka topic to which 1500 message/sec are produced by different producers with each message having two fixed keys RID and Date, (there are other keys to which are varying for each message)
Is there a way to introduce a delay of 1 min in the topic and consume only unique messages in the 1 min window.
Example - In a minute there could be around 90K message in which there could be 1000(random value) message with RID as 1 and Date as 1st Jan 2020.
{"RID": "1" , "Date": "2020-01-01", ....}
I would like to consume only 1 message among 1000(any one among 1000 at random) after 1 minute is completed.
Note: There are 3 partitions for the topic.

What you want seems not to be possible. The brokers cannot deliver message based on your business logic, but they can only deliver all messages.
However, you could implement a client side cache to "de-duplicate" the messages accordingly, and only process a sub-set of the messages after "de-duplication".

I'm not completely sure about your question, but it seems you need the compaction log
It will remove the oldest messages from the topic, only need to configure the compaction for the topic and use as identifier the RID record.
Hope this can help you

Related

Is there any way I can maintain ordering (based on an attribute not by the message time) in a single partition of a kafka topic?

Let's say, this is a one-partition topic, and while consuming the message I want to read it in a sequence based on one of the attributes (Let's assume attr1) in the message.
Message 1 ('attr1'="01") was posted to that partition at 9:50 pm.
Message 2 ('attr1'="03") was posted to that partition at 9:55 pm.
Message 3 ('attr1'="02") was posted to that partition at 10:55 pm.
I want to consume it in the sequence based on the attr1 value, so Message1, Message3, and Message2 should be my consuming order.
No, that is not possible.
A fundamental thing to remember about Kafka is Offset. When you write a message to a partition - its always gets incremental offset.
In your example, if
message 1 gets offset 1
message 2 will get offset 2
message 3 will get offset 3
On the consumer side as well, message will always be read in sequence of increasing offsets. You can specify your consumer to start reading from a particular offset, but once it starts reading the message, the consumer will always get message in the sequence of increasing offset.
You can use alternative tools such as ksqlDB or Kafka Streams to first read the entire topic, then sort based on custom attributes, or use Punctuator class to delay processing based on time windows.
Otherwise, Kafka Connect can dump to a database, where you can query/sort based on columns/fields/attributes, etc.

LOG-END-OFFSET adds up to twice the number of messages in the topic

According to this log-end-offset for the consumers in a consumer group for a topic should add up to the number of messages in that topic. I have a case where log-end-offset is adding up to twice the number of messages in the topic (log-end-offset adds up to 28 whereas there are only 14 messages in the topic). What are some potential explanations for this?
The current issue I am facing with this jdbc sink connector is that there are there is a bad message at offset 0 i.e. if the connector tries to process it then it will fail due to violating a db constraint. We have been able to work around this by manually moving the connector's consumer offset s.t. it skips over the bad message. Then randomly months later, it tried to go back and process it even though nobody manually asked it to. The two issues seem related - it seems like something is tricking the connector into thinking it needs to reprocess all of the messages in the topic, hence why the log-end-offsets add up to twice the number of messages in the topic.
We are on Confluent 5.3.3.
Hi offset is always moving forward even when old messages getting prune / deleted from the topic after configurable retention times / size. So in your case having log end offset of 28 and only 14 messages which are really inside the topic at the moment, is a valid situation.

Consumer with multiple partitions are not interleaved

I am trying to run the simple example as shown in https://projectreactor.io/docs/kafka/release/reference/#_sample_consumer. I see the output that is described in the link however I am confused if this is the expected output. Specifically the link says
The 20 messages published by the Producer sample should appear on the
console. As shown in the output above, messages are consumed in order
for each partition, but messages from different partitions may be
interleaved.
The output in the link is what I seem to be getting too. However everything in partition 1 is consumed first followed by partition 0. What I actually expected was one message from partition 0, a couple from partition 1 then a couple or so from partition 0 and so on (although inside the partition the messages are as expected ordered).
When I run locally I get same output too. Is this something I am missing?
What you're seeing is expected behavior for a very small amount of messages. The consumer will interleave when consuming from multiple partitions, but only with a large quantity of messages.
What happens is that Kafka consumers work in "batches". They poll every so often, and if the 10 messages or so in one partition are small enough to fit in one poll request or "batch", then the consumer will simply consume them all at the same time, before even getting to the next partition. That's why you're not seeing this interleaving effect with 20 messages.
If you retry your test with 20K messages, you should see the interleaving behavior much more clearly.
+1 to #mjuarez 's answer. Just wanted to add that you may also be able to reproduce interleaving messages if you reduce the max.poll.records for your consumer to 1 (the default is 500) thus forcing it to process one message at a time.
From Kafka Reference:
NAME: max.poll.records
DESCRIPTION: The maximum number of records returned in a single call to poll().
TYPE: int
DEFAULT: 500
VALID VALUES: [1,...]
IMPORTANCE: medium

Is it possible consumer Kafka messages after arrival?

I would like to consume events from a kafka topic after the time they arrive. The time on which I want the event to be consumed is in the payload of the message. Is it possible to achieve something like that in Kafka? What are the drawbacks of it?
Practical example: a message M is produced at 12:10, arrives to my kafka topic at 12:11 and I want the consumer to poll it at 12:41 (30 minutes after arrival)
Kafka has a default retention period of all topic for 7 days. You can therefore consume up to a week's data at any moment, the drawback being network saturation if you are constantly doing this.
If you want to consume data that is not at the latest offset, then for any new consumer group, you would set auto.offset.reset=earliest. Otherwise for existing groups, you would need to use kafka-consumer-groups --reset command in order to re-consume an already consumed record.
Sometimes you may want to start from beginning of a topic, for example, if you have a compacted topic, in order to rebuild the "deltas" of the data within a topic - lookup the "Stream / Table Duality"
The time on which I want the event to be consumed is in the payload of the message
Since KIP-32 every message has a timestamp outside the payload, by the way
I want the consumer to poll it ... (30 minutes after arrival)
Sure, you can start a consumer whenever, as long as the data is within the retention window, you will get that event.
There isn't a way to finely control when that happens that other than acually making your consumer at that time, for example 30 minutes later. You could play with max.poll.records and max.poll.interval.ms, but I find anything larger than a few seconds really isn't a use-case for Kafka.
For example, you could rather have a TimerTask around a consumer thread, or Spark or MapReduce scheduled with an Oozie/Airflow task that reads a max amount of records.

In Kafka, is it possible to read messages which are 1 min old?

I’m trying to work on a usecase which requires the messages to be processed from a kafka topic which are 1 min old.
Is there a way in kafka to only read messages which are 1 minute ild ?
Thanks in advance.
The short answer is no.
Kafka consumers consume based on either getting the Latest message in the queue, or the earliest message.
See the docs
(Search for auto.offset.reset)
I think what you should do is hold a buffer of messages in your consuming application. Make your buffer only hold 1 minute's worth of messages and drop messages that are older than 1 minute. That way the oldest message in your buffer is always 1 minute old.
That's how I would do it.
You might be able to leverage the reset offset tooling introduced in 0.11.0.0. One issue is it is a command line tool and there is no programming API for it (yet). But you might be able to sync your application with the tool (or use the tool inside your application) to reset the offset of a partition to 1 minute ago and consume from there:
$ bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --reset-offsets --group test.group --topic foo:0,1 --by-duration P1M
This resets the offset of partitions 0 and 1 of topic foo to the first message in each partition with timestamp after 1 minute ago. You can check the timestamp of the message to decide whether it qualifies for processing (according to your use case) or not.
You will be able to do this using Kafka Streams, state stores and processors. The below solution helps you to process the messages after 1 min, but you will still be consuming the messages immediately.
Create a state store and add it to a stream builder. Create a stream using that builder and add a processor using the above created state store. Use the processor supplier to process each of your messages. You can save all your messages in the state store using process(). Schedule the punctuate() to a 60000 millis and make the punctuate() to get the messages which passed the 1 min delay and process those.
Hope this helps.