I am new to Apache Kafka, so it might be that this is basic knowledge.
At the moment I try to figure out some possibilities and functions that Kafka offers me. And so I was wondering whether it is possible to move a message after a specified period of time to another topic.
Scenario:
Producer 1 writes Message (M1) into Topic 1 where Consumer 1 handles the messages.
After a period of time, let's say 1 hour, M1 is moved into Topic 2 to which the Consumer 2 is subscribed.
It is possible to do something like that with Kafka? I know that there is a way to delete a message after a period of time, but I don't know if there is a way to change to topic or catch the delete-action.
I thought about running a timer in a Producer, but with a huge amount of data, I think that this isn't possible anymore.
Thanks in advance
EDIT:
Thanks to #OneCricketeer i know, that my first assumption with the several producers wasn't that bad.
I know that the throughput with one Producer is really good and that one won't take the system down.
But I'm still concerend about the second producer.
In my imagination it is like the following sketchy image
When I take 30 messages per minute that would mean that I would habe 31 instances of producers. 1 that handles the messages asap and 30 others waiting for the timer to determinate so that they can work with their message.
Counting that up to an hour it would be round about 1800 instances. That is where I#m concerned about. Or is there a better way to handel this?
I found a solution that might work for my case.
I accidentally stumbled over a Consumer-Methode which allows you to read messages based on Timestamp.
The methode is called offsetsForTimes and usable since the Version 0.10.
See the Kafka API or the following post which I found researching about that methode.
Maybe this is usefull for others so I decided to publish this.
Related
We have a use case where we need to write all messages of topic a into topic b, but with a delay of 30 minutes for each message. Why, you ask? Because time is of critical importance for this stream of data, so paying customers get the real-time feed, for freeloaders, we offer the delayed stream.
I guess it would be relatively easy to do in a KafkaConsumer poll() loop, by comparing system time and message time (using an ordered message time like producer time or ingestion time) and then pause()ing the partitions in question and resume()ing them after the appropriate time interval of up to 30 minutes(, all the while continuing to poll() to avoid getting failed over).
As the data, though delayed, still needs to be delivered in a streaming fashion, the delay of the ingestion times of all messages in topic a and b should be as close to 30 minutes as possible.
But is this also easily possible in Kafka Streams, so that we can use its built-in exactly-once guarantees? I wonder if "it's ok to call Thread.sleep() in Kafka Streams also applies to longer sleeps of up to 30 minutes? (Of course we don't want a partition rebalance to occur because Kafka thinks something's wrong with our process)
Assuming we get this to work, is there a way to get proper lag monitoring for this? If we just delay messages, I would think the consumer group lag would always amount to at least 30 minutes worth of messages. So is it possible to have the lag monitor count only unprocessed messages older than 30 minutes?
(2. is of less importance for us than getting 1. to work)
Edit: https://stackoverflow.com/a/59261274/709537 proposes a solution to a somewhat related problem, but that involves state stores and thus looks more complicated than would seem necessary for our simple (?) "delay all messages by x minutes" task.
Regarding 2., I assume we will have to roll our own lag monitoring for this.
A simple way to do something related - measuring latency instead of the number of lagging messages - would be to periodically and for every partition
get the first unread input message
currentLatency = max(0, ingestionTime(firstUnreadMessage) - 30min)
If we wanted to monitor the number of lagging messages, something a little more involved would need to be done:
read input messages backwards, until there is one with ingestionTime + 30min <= systemTime
the number of those messages would be the lag
However reading messages backwards is not exactly one of Kafka's core competencies... A clever binary search style could be devised to get the exact value. However, no-one really wants to know whether the message lag is 43123 or 40513, what they want to know is the order of magnitude. That will keep the number of seeks down to a handful (per partition), and no binary search style back and forth would be necessary. The output could e.g. be
lag < 10
lag < 100
lag < 1000
lag < 10000
...
Dear Apache Kafka friends,
I have a use case for which I am looking for an elegant solution:
Data is published in a Kafka-Topic at a relatively high rate. There are two competing requirements
all records should be kept for 7 days (which is configured by min.compaction.lag)
applications should read the "last status" from the topic during their initialization phase
LogCompaction is enabled in order for the "last state" to be available in the topic.
Now comes the problem. If an application wants to initialize itself from the topic, it has to read a lot of records to get the last state for all keys (the entire topic content must be processed). But this is not performant possible with the amount of records.
Idea
A streaming process streams the data of the topic into a corresponding ShortTerm topic which has a much shorter min.compaction.lag time (1 hour). The applications initialize themselves from this topic.
Risk
The streaming process is a potential source of errors. If it temporarily fails, the applications will no longer receive the latest status.
My Question
Are there any other possible solutions to satisfy the two requirements. Did I maybe miss a Kafa concept that helps to handle these competing requirements?
Any contribution is welcome. Thank you all.
If you don't have a strict guarantee how frequently each key will be updated, you cannot do anything else as you proposed.
To avoid the risk that the downstream app does not get new updates (because the data replication jobs stalls), I would recommend to only bootstrap an app from the short term topic, and let it consume from the original topic afterwards. To not miss any updates, you can sync the switch over as follows:
On app startup, get the replication job's committed offsets from the original topic.
Get the short term topic's current end-offsets (because the replication job will continue to write data, you just need a fixed stopping point).
Consume the short term topic from beginning to the captured end offsets.
Resume consuming from the original topic using the captured committed offsets (from step 1) as start point.
This way, you might read some messages twice, but you won't lose any updates.
To me, the two requirements you have mentioned together with the requirement for new consumers are not competing. In fact, I do not see any reason why you should keep a message of an outdated key in your topic for 7 days, because
New consumers are only interested in the latest message of a key.
Already existing consumers will have processed the message within 1 hour (as taken from your comments).
Therefore, my understanding is that your requirement "all records should be kept for 7 days" can be replaced by "each consumer should have enough time to consume the message & the latest message for each key should be kept for 7 days".
Please correct me if I am wrong and explain which consumer actually does need "all records for 7 days".
If that is the case you could do the following:
Enable log compaction as well as time-based retention to 7 days for this topic
Fine-tune the compaction frequency to be very eager, meaning to keep as little as possible outdated messages for a key.
Set min.compaction.lag to 1 hour such that all consumers have the chance to keep up.
That way, new consumers will read (almost) only the latest message for each key. If that is not performant enough, you can try increasing the partitions and consumer threads of your consumer groups.
I am wondering if there is something I am missing about my set up to facilitate long running jobs.
For my purposes it is ok to have At most once message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).
I have the following in order to achieve the competing consumer pattern:
A topic
X consumers in the same group
P partitions in a topic (where P >= X always)
My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of max.poll.interval.ms to reflect this.
However this comes with some negative consequences:
if some message exceeds this length of time then in a worst case scenario a the consumer processing this message will have to wait up to the value of max.poll.interval.ms for a rebalance
if I need to scale and increase the number of consumers based on load then any new consumers might also have to wait the value of max.poll.interval.ms for a rebalance to occur in order to process any new messages
As it stands at the moment I see that I can proceed as follows:
Set max.poll.interval.ms to be a small value and accept that every consumer processing every message will time out and go through the process of having assignments revoked and waiting a small amount of time for a rebalance
However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this.
Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable.
I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.
I appreciate any advise that anybody can offer on this subject.
Using Kafka as a Job queue for scheduling long running process is not a good idea as Kafka is not a queue in the strictest sense and semantics for failure handling and retries are limited. Though you might be able to achieve a compromise by playing around with certain configuration for rebalance or timeout, it is likely to remain brittle design. Simple answer is that Kafka was not designed for these kind of usecases.
The idea of max.poll.interval.ms is to prevent livelock situation (see), but in your case, consumer will send a false positive to the Kafka broker and will trigger a rebalance as there is no way to distinguish between a livelock and a legitimate long process.
You should think about the tradeoffs between living with the negative consequences you mentioned Vs. introducing a new technology which helps you to model a job queue in a better way. For a more complex usecase, check out how slack is doing it.
The way we got around the issues we were having was as suggested in the comments.
We decided to decouple the message processing from the consumer polling.
On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.
We also did some work with trying to reduce the processing times for messages.
However some messages still take time that can be measured in minutes.
This has worked for us now for some time with no issues.
Thanks for this suggestions in comments #Donal
Good Day,
I would like to find out if kafka queue can hold data for a few seconds and than release data.
I receive a message from a kafka topic,
After parsing the data, I hold it in memory for some time (10 seconds) (This builds up as unique messages come through), with each message having it's own timer), I want kafka to tell me that that message has expired (10 seconds) so that i can continue with other tasks.
But since flink/kafka is event driven, I was hoping kafka has some sort of round timing wheel that can reproduce the key for a message after 10 seconds to the consumer.
Any idea on how I can archieve this using flink windowing or kafka features?
Regards
Regarding your initial problem:
I would like to find out if kafka queue can hold data for a few seconds and than release data
You can set up log.cleanup.policy as delete (this is the default) and change the retention.ms from the default 604800000 (1 week) to 10000.
Can you explain again what else you want to check, and what did you mean after the Regards part?
You could look closer to Kafka Streams library. https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html, https://kafka.apache.org/21/documentation/streams/developer-guide/processor-api.html.
Using Kafka Streams you can do lot of complex event processing work. Processor API is lower level API and gives you more flexibility, ex Each processing message put in state store (Kafka Streams abstraction, that is replicated to changelog topic) and then with the Punctuator you can check if message expired.
In my kafka consumer threads(high level), after I consumed a message I am applying some business logic to this message and forwarding this to a WS. But this webservice may be down sometimes and since I consumed this object from kafka and offset is moved forward, i would missed this object.
One way get rid of from this problem is to disabling autocommit in zookeeper and committing offset by calling programmaticaly but i expect that this is a very costly operation. I will be producing to kafka at about 2000 tps and may increase later times.
Another way - which i am not sure if it is a good idea - is if i face with any problem, producing this consumed object to kafka again but i didn't see any post related to this across all my googleings. Is this a thing which is even not considerable?
Can you please give me some insights about handling this situation.
Thanks
You can post back the failed message to the same topic or another of your choice.
If you use the same topic, you will push the messages at the end of the topic and they will be picked up after the others (so if order matters to you don't do this). Also if the action that you perform before sending the message is not idempotent you will have to something to identifying this records so they don't perform the action twice.
If you use a failed_topic, you can push the messages that you can't send to this topic and when the WS is healthy again you need to create a consumer that consumes all the messages there and sends them to the WS.
Hope it helps!
Moving such messages to an error queue and retrying them later is a well known approach.
See Dead letter channel