I wanted to know that in my Kafka Streams application or Spring-Kafka;
Is there a way where I can read my messages from a topic in some time interval.
Read 1000 records per 5 minutes let's say.
Read 1000 records from a topic, wait 5 min again and consume 1000 messages again.
I have read the .poll() documentation but it does not do what I actually want. It says
The configuration poll.ms is the maximum "blocking time" within poll() if no data avaliable.
Think like a slow notification processing. Can I handle this with consumer, producer api or using kafka streams ?
Thanks !
Related
I'm creating a scraper. The producer sends data to Kafka's topic with the information about links to be scraped. The consumer is an AWS Lambda function that will be triggered when a message is received on that topic.
To avoid blocking, I want to add a cap on the maximum number of messages consumed in a given time. For example, I just want to consume only 5 messages in a minute. While the producer should keep pushing the messages to Kafka.
How can I achieve this?
How to read specific number of messages per minute from apache kafka message queue? For example, imagine that there are 100 messages in the queue, how can I get 5 messages to be read per minute. I don't know how to set "max.partition.fetch.bytes" as my byte size is not the same in every message.
Is there a way to dynamically set this to read 5 messages per minute?
I've a kafka topic, 3 partitions, only one consumer with batch. I am using spring kafka on the consumer side with following consumer props:
max.poll.records=10000
fetch.min.bytes=2000000
fetch.max.bytes=15000000
fetch.max.wait.ms=1000
max.poll.interval.ms=300000
auto.offset.reset.config=earliest
idle.event.interval=120000
Even tho there are thousands of messages (GBs of data) waiting in the queue, kafka consumer receives around 10 messages (total size around 1MB) on each poll. The consumer should fetch batches of fetch.max.bytes(in my prop ~15MB) or max.poll.records (10000 in my case) . What's the problem?
There are several scenarios which may cause this, do the following changes:
Increase fetch.min.bytes- the consumer also may fetch batches of fetch.min.bytes, which is 1.9MB.
Increase fetch.max.wait.ms- the poll function waits for fetch.min.bytes or fetch.max.wait.ms to trigger, whatever comes first.
fetch.max.wait.ms is 1 second in your configuration, sounds alright but increase it just in case this is the problem.
Increase max.partition.fetch.bytes- default is 1MB, it can decrease the poll size for small partitioned topics like yours (limit up to 3MB poll for 3 partitions topic with a single consumer).
Try to use these values:
fetch.min.bytes=12000000
fetch.max.wait.ms=5000
max.partition.fetch.bytes=5000000
Deeper explanation:
https://www.oreilly.com/library/view/kafka-the-definitive/9781491936153/ch04.html
Good Day,
I would like to find out if kafka queue can hold data for a few seconds and than release data.
I receive a message from a kafka topic,
After parsing the data, I hold it in memory for some time (10 seconds) (This builds up as unique messages come through), with each message having it's own timer), I want kafka to tell me that that message has expired (10 seconds) so that i can continue with other tasks.
But since flink/kafka is event driven, I was hoping kafka has some sort of round timing wheel that can reproduce the key for a message after 10 seconds to the consumer.
Any idea on how I can archieve this using flink windowing or kafka features?
Regards
Regarding your initial problem:
I would like to find out if kafka queue can hold data for a few seconds and than release data
You can set up log.cleanup.policy as delete (this is the default) and change the retention.ms from the default 604800000 (1 week) to 10000.
Can you explain again what else you want to check, and what did you mean after the Regards part?
You could look closer to Kafka Streams library. https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html, https://kafka.apache.org/21/documentation/streams/developer-guide/processor-api.html.
Using Kafka Streams you can do lot of complex event processing work. Processor API is lower level API and gives you more flexibility, ex Each processing message put in state store (Kafka Streams abstraction, that is replicated to changelog topic) and then with the Punctuator you can check if message expired.
I would like to consume events from a kafka topic after the time they arrive. The time on which I want the event to be consumed is in the payload of the message. Is it possible to achieve something like that in Kafka? What are the drawbacks of it?
Practical example: a message M is produced at 12:10, arrives to my kafka topic at 12:11 and I want the consumer to poll it at 12:41 (30 minutes after arrival)
Kafka has a default retention period of all topic for 7 days. You can therefore consume up to a week's data at any moment, the drawback being network saturation if you are constantly doing this.
If you want to consume data that is not at the latest offset, then for any new consumer group, you would set auto.offset.reset=earliest. Otherwise for existing groups, you would need to use kafka-consumer-groups --reset command in order to re-consume an already consumed record.
Sometimes you may want to start from beginning of a topic, for example, if you have a compacted topic, in order to rebuild the "deltas" of the data within a topic - lookup the "Stream / Table Duality"
The time on which I want the event to be consumed is in the payload of the message
Since KIP-32 every message has a timestamp outside the payload, by the way
I want the consumer to poll it ... (30 minutes after arrival)
Sure, you can start a consumer whenever, as long as the data is within the retention window, you will get that event.
There isn't a way to finely control when that happens that other than acually making your consumer at that time, for example 30 minutes later. You could play with max.poll.records and max.poll.interval.ms, but I find anything larger than a few seconds really isn't a use-case for Kafka.
For example, you could rather have a TimerTask around a consumer thread, or Spark or MapReduce scheduled with an Oozie/Airflow task that reads a max amount of records.