I am having a Kafka Consumer which fetches the broker in a given fetch interval. It is fetching with the given time interval and it is fine when the messages are there in a topic. But i really don't know the reason why consumer is sending more fetch requests when there are no messages in kafka topic.
In general, consumers send two types of requests to broker:
Heartbeat
Poll request
Heartbeat is sent via separate thread and its interval is configured with heartbeat.interval.ms (3 seconds in default)
For poll request there is no specific interval and it's up to your code. (there is just an upper bound for it (max.poll.interval.ms))
It is absolutely reasonable to send more frequent poll requests when there is no data in partition(s) that your consumer is assigned to. Suppose that you have a code like this:
void consumeLoop() {
while (true) {
records = consumer.poll();
if(!records.isEmpty()) {
processMessages(records);
}
}
}
As you see if there is no records returned from poll, then your consumer will immediately send another poll request. But if there is data, you should first process these records before sending next poll request.
Related
I have a use case where I want to implement synchronous request / response on top of kafka. For example when the user sends an HTTP request, I want to produce a message on a specific kafka input topic that triggers a dataflow eventually resulting in a response produced on an output topic. I want then to consume the message from the output topic and return the response to the caller.
The workflow is:
HTTP Request -> produce message on input topic -> (consume message from input topic -> app logic -> produce message on output topic) -> consume message from output topic -> HTTP Response.
To implement this case, upon receiving the first HTTP request I want to be able to create on the fly a consumer that will consume from the output topic, before producing a message on the input topic. Otherwise there is a possibility that messages on the output topic are "lost". Consumers in my case have a random group.id and have auto.offset.reset = latest for application reasons.
My question is how I can make sure that the consumer is ready before producing messages. I make sure that I call SubscribeTopics before producing messages. but in my tests so far when there are no committed offsets and kafka is resetting offsets to latest, there is a possibility that messages are lost and never read by my consumer because kafka sometimes thinks that the consumer registered after the messages have been produced.
My workaround so far is to sleep for a bit after I create the consumer to allow kafka to proceed with the commit reset workflow before I produce messages.
I have also tried to implement logic in a rebalance call back (triggered by a consumer subscribing to a topic), in which I am calling assign with offset = latest for the topic partition, but this doesn't seem to have fixed my issue.
Hopefully there is a better solution out there than sleep.
Most HTTP client libraries have an implicit timeout. There's no guarantee your consumer will ever consume an event or that a downstream producer will send data to the "response topic".
Instead, have your initial request immediately return 201 Accepted status (or 400, for example, if you do request validation) with some tracking ID. Then require polling GET requests by-id for status updates either with 404 status or 200 + some status field within the request body.
You'll need a database to store intermediate state.
I'm creating a scraper. The producer sends data to Kafka's topic with the information about links to be scraped. The consumer is an AWS Lambda function that will be triggered when a message is received on that topic.
To avoid blocking, I want to add a cap on the maximum number of messages consumed in a given time. For example, I just want to consume only 5 messages in a minute. While the producer should keep pushing the messages to Kafka.
How can I achieve this?
When a consumer drops from a group and a rebalance is triggered, I understand no messages are consumed -
But does an in-flight request for messages stay queued passed the max wait time?
Or does Kafka send any payload back during the rebalance?
UPDATE
For clarification, I'm referring specifically to the consumer polling process.
From my understanding, when one of the consumers drop from the consumer group, a rebalance of the partitions to consumers is performed.
During the rebalance, will an error be sent back to the consumer if it's already polled and waiting for max time to pass?
Or does Kafka wait the max time and send an empty payload?
Or does Kafka queue the request passed max wait time until the rebalance is complete?
Bottom line - I'm trying to explain periodic timeouts from consumers.
This may be in the docs, but I'm not sure where to find it.
Kafka producers doesn't directly send messages to their consumers, rather they send them to the brokers.
The inflight requests corresponds to the producer and not to the consumer.
Whether the consumer leaves a group and a rebalance is triggered or not is quite immaterial to the behaviour of the producer.
Producer messages are queued in the buffer, batched, optionally compressed and sent to the Kafka broker as per the configuration.
In-flight requests are the maximum number of unacknowledged requests
the client will send on a single connection before blocking.
Note that when we say ack, it is acknowledgement by the broker and not by the consumer.
Does Kafka send any payload back during the rebalance?
Kafka broker doesn't notify of any rebalance to its producers.
How long does Partition Rebalancing take when a new Consumer joins the group?
What factors can affect this?
In my understanding, these factors play role:
Consumers needs to finish processing the data they polled last time.
Coordinator waits for Consumer to send JoinGroup request - for how long?
Consumers send SyncGroup request (is there a delay between receiving JoinGroup response and sending SyncGroup request?)
In a normal situation, assuming that Consumers process data instantly, how long should one expect to Partition Rebalancing to take place?
Coordinator waits for at most rebalance.timeout.ms (1 minute by default). If Consumer does not join during this time it is considered dead. But if all consumers re-join before this, the Coordinator will not wait further (my assumption).
Usual Consumers send Sync group as soon as they received the JoinGroup response. Leader needs more time (to perform partition assignment logic).
Source:
https://www.confluent.io/online-talks/everything-you-always-wanted-to-know-about-kafkas-rebalance-protocol-but-were-afraid-to-ask-on-demand/
If I poll() from a consumer in a while True: statement, I see that poll() is blocking. If the consumer is up to date with messages from the topic (offset = OFFSET_END) how is the consumer conducting it's blocking poll()?
Does the consumer default adhere to a pub/sub mentality in which it sleeps and waits for a publish and a broadcast/signal from the broker?
Or is the consumer constantly spinning itself checking the topic?
I'm using the confluent python client, if that matters.
Thanks!
kafka consumers are basically long poll loops, driven (asynchronously) by the user thread calling poll().
the whole protocol is request-response, and entirely client driven. there is no form of broker-initiated "push".
fetch.max.wait.ms controls how long any single broker will wait before responding (if no data), while blocking of the user thread is controlled by argument to poll()
Yes, you are right its while a true condition that waits to consume the message till waiting timeout time.
If it receives a message it will return immediately otherwise it will await to passed timeout and return an empty record.
Kafka Broker use the below parameter to control message to send to Consumer
fetch.min.bytes: The broker will wait for this amount of data to fill BEFORE it sends the response to the consumer client.
fetch.wait.max.ms: The broker will wait for this amount of time BEFORE sending a response to the consumer client unless it has enough data to fill the response (fetch.message.max.bytes)
There is a possibility to take a long time to call the next poll() due to the processing of consumed messages. max.poll.interval.ms prevent not to process take so much time and call the next poll within max.poll.interval.ms otherwise consumer leaves the group and trigger rebalance.
You can get more detail about this here
max.poll.interval.ms: By increasing the interval between expected polls, you can give the consumer more time to handle a batch of
records returned from poll(long). The drawback is that increasing this
value may delay a group rebalance since the consumer will only join
the rebalance inside the call to poll. You can use this setting to
bound the time to finish a rebalance, but you risk slower progress if
the consumer cannot actually call poll often enough.
max.poll.records: Use this setting to limit the total records returned from a single call to a poll. This can make it easier to
predict the maximum that must be handled within each poll interval. By
tuning this value, you may be able to reduce the poll interval, which
will reduce the impact of group rebalancing.