I'm trying to use one consumer continuously read data from kafka. How should I set the scheduling options?
I have read the User Guide, but I can not figure out how to set the run schedule and run duration if I need the consumer run continuously.
A Timer Driven Run Schedule of 0 seconds means run as fast as possible continuously.
Related
I am currently working on an application which will schedule a task as a timers. The timers can be run on any day of the week with the configuration by the user. Currently it is implemented with bullqueue and redis for storage. Once the timer will execute it will execute an event and further process the business logic. There can be thousands of queue messages in the redis.
I am looking to replace redis with Kafa as I have read it is easy to scale and guarantee of no message loss.
The question is. Is it a good idea to go with Kafa? If yes then how can we schedule a jobs in kafka with the combination of bullqueue. I am new to Kafka and still trying to understand how can we schedule the jobs in Kafka or is it a good architecture setup to go with.
My current application setup is with nestjs, nodejs
Kafka doesn't have any feature like this built-in, so you'd need to combine it with some other timer/queue system for scheduling a KafkaProducer action.
Similarly, Kafka Consumers are typically always running, although, you can start/pause them periodically as well.
I have a producer application which sends data to a Kafka topic, but only once in a while, as and when it receives from a source. I also have a consumer application (Spark) which keeps running all the time and receives data from Kafka when producer sends to it.
Since the consumer keeps running all the time, there is wastage of resources at times. Moreover, because my producer sends data only once in a while, is there any way to trigger the consumer application only when a kafka topic gets any data?
Sounds like you shouldn't be using Spark and would rather run some Serverless solution that can be triggered to run code on Kafka events.
Otherwise, run Cron to look at consumer lag. Define a threshold to submit your code at, then batch read from Kafka only then
We are using Kafka Streaming library to build real time notification kind of system for incoming messages on a Kafka topic, so while the streaming app is running, it processes all incoming messages in a topic at real time and send notification if it encounter certain kind of pre-defined incoming message.
If in case the Streaming App is down and it is started again, we require to process only recent messages arriving after streaming app is initialized. This is to avoid processing old records which were not processed while streaming app was not running or down. By default the streaming App starts processing old messages since last committed offset. Is there any setting in Kafka Streaming App to allow processing only most recent message?
KafkaConsumer's 'auto.offset.reset' default value is 'latest'
but You want to use KafkaStreams, default is 'earliest'
reference : https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/StreamsConfig.java#L634
Therefore,
if set auto.offset.reset is 'latest' it will be what you want.
Your assumption is correct. Even if your set auto.offset.reset to latest, your app already have a consumer offset.
So you will have to reset the offsets to latest with the kafka-consumer-groups command with those options --reset-offsets --to-latest --execute.
Check the different reset scenarios , you can even reset to a particularly datetime, or by period, from a file etc..
I have a use case similar to the one described here
which is that I would like to put the failed messages in a place where they can be retried at a pre-configured time interval (probably several minutes). Since I am not using Storm, the Kafka Spout is not an option as described in the accepted solution there. Is there another feature of Kafka that makes the message invisible to the consumers until the time period expires?
One of the goals of this project is to not write a scheduler (Cron or Java).
Without the scheduler, the only other option is using JMS style messaging brokers. However, if Kafka has this functionality built-in I would like to use Kafka as we already have the infrastructure built for it.
I'm working with Logstash's Kafka input plugin and I don't see any setting which would allow me to specify how often should the consumer poll.
If I check the consumer configuration I don't see anything either there.
How does it work then?
How does the consumer know when to try to pull data in?
There are a variety of configurables on that one, but poll_timeout_ms seems to be the one controlling the poll interval. It defaults to 100ms. The Kafka plugin maintains an open TCP connection to the Kafka cluster, so you don't get connection-establishment overhead for a fast polling interval.