Is there any way I can maintain ordering (based on an attribute not by the message time) in a single partition of a kafka topic? - apache-kafka

Let's say, this is a one-partition topic, and while consuming the message I want to read it in a sequence based on one of the attributes (Let's assume attr1) in the message.
Message 1 ('attr1'="01") was posted to that partition at 9:50 pm.
Message 2 ('attr1'="03") was posted to that partition at 9:55 pm.
Message 3 ('attr1'="02") was posted to that partition at 10:55 pm.
I want to consume it in the sequence based on the attr1 value, so Message1, Message3, and Message2 should be my consuming order.

No, that is not possible.
A fundamental thing to remember about Kafka is Offset. When you write a message to a partition - its always gets incremental offset.
In your example, if
message 1 gets offset 1
message 2 will get offset 2
message 3 will get offset 3
On the consumer side as well, message will always be read in sequence of increasing offsets. You can specify your consumer to start reading from a particular offset, but once it starts reading the message, the consumer will always get message in the sequence of increasing offset.

You can use alternative tools such as ksqlDB or Kafka Streams to first read the entire topic, then sort based on custom attributes, or use Punctuator class to delay processing based on time windows.
Otherwise, Kafka Connect can dump to a database, where you can query/sort based on columns/fields/attributes, etc.

Related

How to Retrieve Kafka message fast based on key?

i have scenario where I need to test Kafka message when transaction is completed. How to retrieve the message fast using Java? I know the key initial first 10 digit details which is unique.
Currently I am reading all partition and offset for the relevant topic which is not efficient(worst case scenario takes 2 min to get key)
This is not really possible with Kafka, each Kafka partition is an append-only log that uses an offset to specify its position. The key isn't used when reading the partition.
The only way to "seek" a specific message in a partition is through its offset, so instead of reading the whole partition if you know that the message is roughly from one hour ago(or another timeframe) you can consume just that piece of information.
See this answer on how to initialize a consumer on a specific offset based on timestamp in Java

Copy messages from one Kafka topic to another using offsets/timestamps

For some data processing , we need to reprocess all the messages between 2 timestamps
say between 1st Jan to 15th Jan.
to control upper bound we are planning to create a new topic that will have these messages so that once this task is complete , we can delete the topic too.
The new topic will have data from a particular offsets of source topic
partition 1 - from offset 100
partition 2 - from offset 2400...
and so on
What is the most suitable solution for this ? approx 10lacs messages fall in this.
Create a consumer from the source topic.
Call .assign for the partitions you want to copy
Call .seek for each starting offset of those partitions. You can use offsetsForTimes method to get them for a specific timestamp; then you can pass those on to the seek method.
Create a Producer
Start a poll loop (one thread per partition, ideally,each thread with the reference of the created producer).
As polling, check the timestamp of the record
If record timestamp exceeds the date you're reading to, stop the poll loop / thread
Else, send that data via the producer to your output topic

Unique messages from a kafka topic with in a time interval

I have a kafka topic to which 1500 message/sec are produced by different producers with each message having two fixed keys RID and Date, (there are other keys to which are varying for each message)
Is there a way to introduce a delay of 1 min in the topic and consume only unique messages in the 1 min window.
Example - In a minute there could be around 90K message in which there could be 1000(random value) message with RID as 1 and Date as 1st Jan 2020.
{"RID": "1" , "Date": "2020-01-01", ....}
I would like to consume only 1 message among 1000(any one among 1000 at random) after 1 minute is completed.
Note: There are 3 partitions for the topic.
What you want seems not to be possible. The brokers cannot deliver message based on your business logic, but they can only deliver all messages.
However, you could implement a client side cache to "de-duplicate" the messages accordingly, and only process a sub-set of the messages after "de-duplication".
I'm not completely sure about your question, but it seems you need the compaction log
It will remove the oldest messages from the topic, only need to configure the compaction for the topic and use as identifier the RID record.
Hope this can help you

Kafka to Kafka -> reading source kafka topic multiple times

I new to Kafka and i have a configuration where i have a source Kafka topic which has messages with a default retention for 7 days. I have 3 brokers with 1 partition and 1 replication.
When i try to consume messages from source Kafka topic and to my target Kafka topic i was able to consume messages in the same order. Now my question is if i am trying to reprocess all the messages from my source Kafka and consume in ,y Target Kafka i see that my Target Kafka is not consuming any messages. I know that duplication should be avoided but lets say i have a scenario where i have 100 messages in my source Kafka and i am expecting 200 messages in my target Kafka after running it twice. But i am just getting 100 messages in my first run and my second run returns nothing.
Can some one please explain why this is happening and what is the functionality behind it ?
Kafka consumer reads data from a partition of a topic. One consumer can read from one partition at one time only.
Once a message has been read by the consumer, it can't be re-read again. Let me first explain the current offset. When we call a poll method, Kafka sends some messages to us. Let us assume we have 100 records in the partition. The initial position of the current offset is 0. We made our first call and received 100 messages. Now Kafka will move the current offset to 100.
The current offset is a pointer to the last record that Kafka has already sent to a consumer in the most recent poll and that has been committed. So, the consumer doesn't get the same record twice because of the current offset. Please go through the following diagram and URL for complete understanding.
https://www.learningjournal.guru/courses/kafka/kafka-foundation-training/offset-management/

Reading messages for specific timestamp in kafka

I want to read all the messages starting from a specific time in kafka.
Say I want to read all messages between 0600 to 0800
Request messages between two timestamps from Kafka
suggests the solution as the usage of offsetsForTimes.
Problem with that solution is :
If say my consumer is switched on everyday at 1300. The consumer would not have read any messages that day, which effectively means no offset was committed at/after 0600, which means offsetsForTimes(< partitionname > , <0600 for that day in millis>) will return null.
Is there any way I can read a message which was published to kafka queue at a certain time, irrespective of offsets?
offsetsForTimes() returns offsets of messages that were produced for the requested time. It works regardless if offsets were committed or not because the offsets are directly fetched from the partition logs.
So yes you should be using this method to find the first offset produced after 0600, seek to that position and consume messages until you reach 0800.