Processing kafka messages taking long time - apache-kafka

I have a Python process (or rather, set of processes running in parallel within a consumer group) that processes data according to inputs coming in as Kafka messages from certain topic. Usually each message is processed quickly, but sometimes, depending on the content of the message, it may take a long time (several minutes). In this case, Kafka broker disconnects the client from the group and initiates the rebalance. I could set session_timeout_ms to a really large value but it would be like 10 minutes of more, which means if a client dies, the cluster would not be properly rebalanced for 10 minutes. This seems to be a bad idea. Also, most messages (about 98% of them) are fast, so paying such penalty for just 1-2% of messages seems wasteful. OTOH, large messages are frequent enough to cause a lot of rebalances and cost a lot of performance (since while the group is rebalancing, nothing is getting done, and then the "dead" client re-joins again and causes another rebalance).
So, I wonder, are there any other ways for handling messages that take a long time to process? Is there any way to initiate heartbeats manually to tell the broker "it's ok, I am alive, I'm just working on the message"? I thought the Python client (I use kafka-python 1.4.7) was supposed to do that for me but it doesn't seem to happen. Also, the API doesn't seem to even have separate "heartbeat" function at all. And as I understand, calling poll() would actually get me the next messages - while I am not even done with the current one, and would also mess up iterator API for Kafka consumer, which is quite convenient to use in Python.
In case it's important, the Kafka cluster is Confluent, version 2.3 if I remember correctly.

In Kafka, 0.10.1+ Kafka polling and session heartbeat are decoupled to each other.
You can get an explanationhere
max.poll.interval.ms how much time permit to complete processing by consumer instance before time out means if processing time takes more than max.poll.interval.ms time Consumer Group will presume its die remove from Consumer Group and invoke rebalance.
To increase this will increase the interval between expected polls which give consumers more time to handle a batch of records returned from poll(long).
But at the same time, it will also delay group rebalances since the consumer will only join the rebalance inside the call to poll.
session.timeout.ms is the timeout used to identify if the consumer is still alive and sending a heartbeat on a defined interval (heartbeat.interval.ms). In general, the thumb-rule is heartbeat.interval.ms should be 1/3 of session timeout so in case of network failure consumers can miss at most 3-time heartbeat before session timeout.
session.timeout.ms: low value would be good to detect failure more quickly.
max.poll.interval.ms: large value will reduce the risk of failure due to increased processing time however increases the rebalancing time.
Note: A large number of partition and topics consumed by Consumer Group also effect on overall rebalance time
The other approach if you would really want to get rid of rebalancing you can assign partitions on each consumer instance manually, using partition assign. In that case, each consumer instance will be running independently with their own assigned partitions. But in that case, you would not able to leverage the rebalance features to assign partitions automatically.

Related

Should we use max.poll.records or max.poll.interval.ms to handle records that take longer to process in kafka consumer?

I'm trying to understand what is better option to handle records that take longer to process in kafka consumer? I ran few tests to understand this and observed that we can control this with by modifying either max.poll.records or max.poll.interval.ms.
Now my question is, what's the better option to choose? Please suggest.
max.poll.records simply defines the maximum number of records returned in a single call to poll().
Now max.poll.interval.ms defines the delay between the calls to poll().
max.poll.interval.ms: The maximum delay between invocations of
poll() when using consumer group management. This places an upper
bound on the amount of time that the consumer can be idle before
fetching more records. If poll() is not called before expiration of
this timeout, then the consumer is considered failed and the group
will rebalance in order to reassign the partitions to another member.
For consumers using a non-null group.instance.id which reach this
timeout, partitions will not be immediately reassigned. Instead, the
consumer will stop sending heartbeats and partitions will be
reassigned after expiration of session.timeout.ms. This mirrors the
behavior of a static consumer which has shutdown.
I believe you can tune both in order to get to the expected behaviour. For example, you could compute the average processing time for the messages. If the average processing time is say 1 second and you have max.poll.records=100 then you should allow approximately 100+ seconds for the poll interval.
If you have slow processing and so want to avoid rebalances then tuning either would achieve that. However extending max.poll.interval.ms to allow for longer gaps between poll does have a bit of a side effect.
Each consumer only uses 2 threads - polling thread and heartbeat thread.
The latter lets the group know that your application is still alive so can trigger a rebalance before max.poll.interval.ms expires.
The polling thread does everything else in terms of group communication so during the poll method you find out if a rebalance has been triggered elsewhere, you find out if a partition leader has died and hence metadata refresh is required. The implication is that if you allow longer gaps between polls then the group as a whole is slower to respond to change (for example no consumers start receiving messages after a rebalance until they have all received their new partitions - if a rebalance occurs just after one consumer has started processing a batch for 10 minutes then all consumers will be hanging around for at least that long).
Hence for a more responsive group in situations where processing of messages is expected to be slow you should choose to reduce the records fetched in each batch.

Kafka as a message queue for long running tasks

I am wondering if there is something I am missing about my set up to facilitate long running jobs.
For my purposes it is ok to have At most once message delivery, this means it is not required to think about committing offsets (or at least it is ok to commit each message offset upon receiving it).
I have the following in order to achieve the competing consumer pattern:
A topic
X consumers in the same group
P partitions in a topic (where P >= X always)
My problem is that I have messages that can take ~15 minutes (but this may fluctuate by up to 50% lets say) in order to process. In order to avoid consumers having their partition assignments revoked I have increased the value of max.poll.interval.ms to reflect this.
However this comes with some negative consequences:
if some message exceeds this length of time then in a worst case scenario a the consumer processing this message will have to wait up to the value of max.poll.interval.ms for a rebalance
if I need to scale and increase the number of consumers based on load then any new consumers might also have to wait the value of max.poll.interval.ms for a rebalance to occur in order to process any new messages
As it stands at the moment I see that I can proceed as follows:
Set max.poll.interval.ms to be a small value and accept that every consumer processing every message will time out and go through the process of having assignments revoked and waiting a small amount of time for a rebalance
However I do not like this, and am considering looking at alternative technology for my message queue as I do not see any obvious way around this.
Admittedly I am new to Kafka, and it is just a gut feeling that the above is not desirable.
I have used RabbitMQ in the past for these scenarios, however we need Kafka in our architecture for other purposes at the moment and it would be nice not to have to introduce another technology if Kafka can achieve this.
I appreciate any advise that anybody can offer on this subject.
Using Kafka as a Job queue for scheduling long running process is not a good idea as Kafka is not a queue in the strictest sense and semantics for failure handling and retries are limited. Though you might be able to achieve a compromise by playing around with certain configuration for rebalance or timeout, it is likely to remain brittle design. Simple answer is that Kafka was not designed for these kind of usecases.
The idea of max.poll.interval.ms is to prevent livelock situation (see), but in your case, consumer will send a false positive to the Kafka broker and will trigger a rebalance as there is no way to distinguish between a livelock and a legitimate long process.
You should think about the tradeoffs between living with the negative consequences you mentioned Vs. introducing a new technology which helps you to model a job queue in a better way. For a more complex usecase, check out how slack is doing it.
The way we got around the issues we were having was as suggested in the comments.
We decided to decouple the message processing from the consumer polling.
On each worker/consumer there were 2 threads, one for doing the actual processing and the other for phoning home to Kafka periodically.
We also did some work with trying to reduce the processing times for messages.
However some messages still take time that can be measured in minutes.
This has worked for us now for some time with no issues.
Thanks for this suggestions in comments #Donal

Kafka group re-balancing after consumer failed. org.apache.kafka.clients.consumer.internals.ConsumerCoordinator

I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.

Why is the kafka consumer consuming the same message hundreds of times?

I see from the logs that exact same message is consumed by the 665 times. Why does this happen?
I also see this in the logs
Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies
that the poll loop is spending too much time message processing. You can address this either by increasing the session
timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
Consumer properties
group.id=someGroupId
bootstrap.servers=kafka:9092
enable.auto.commit=false
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
session.timeout.ms=30000
max.poll.records=20
PS: Is it possible to consume only a specific number of messages like 10 or 50 or 100 messages from the 1000 that are in the queue?
I was looking at 'fetch.max.bytes' config, but it seems like it is for a message size rather than number of messages.
Thanks
The answer lies in the understanding of the following concepts:
session.timeout.ms
heartbeats
max.poll.interval.ms
In your case, your consumer receives a message via poll() but is not able to complete the processing in max.poll.interval.ms time. Therefore, it is assumed hung by the Broker and re-balancing of partitions happen due to which this consumer loses the ownership of all partitions. It is marked dead and is no longer part of a consumer group.
Then when your consumer completes the processing and calls poll() again two things happen:
Commit fails as the consumer no longer owns the partitions.
Broker identifies that the consumer is up again and therefore a re-balance is triggered and the consumer again joins the Consumer Group, start owning partitions and request messages from the Broker. Since the earlier message was not marked as committed (refer #1 above, failed commit) and is pending processing, the broker delivers the same message to consumer again.
Consumer again takes a lot of time to process and since is unable to finish processing in less than max.poll.interval.ms, 1. and 2. keep repeating in a loop.
To fix the problem, you can increase the max.poll.interval.ms to a large enough value based on how much time your consumer needs for processing. Then your consumer will not get marked as dead and will not receive duplicate messages.
However, the real fix is to check your processing logic and try to reduce the processing time.
The fix is described in the message you pasted:
You can address this either by increasing the session timeout or by
reducing the maximum size of batches returned in poll() with
max.poll.records.
The reason is a timeout is reached before your consumer is able to process and commit the message. When your Kafka consumer "commits", it's basically acknowledging receipt of the previous message, advancing the offset, and therefore moving onto the next message. But if that timeout is passed (as is the case for you), the consumer's commit isn't effective because it's happening too late; then the next time the consumer asks for a message, it's given the same message
Some of your options are to:
Increase session.timeout.ms=30000, so the consumer has more time
process the messages
Decrease the max.poll.records=20 so the consumer has less messages it'll need to work on before the timeout occurs. But this doesn't really apply to you because your consumer is already only just working on a single message
Or turn on enable.auto.commit, which probably also isn't the best solution for you because it might result in dropping messages though, as mentioned below:
If we allowed offsets to auto commit as in the previous example
messages would be considered consumed after they were given out by the
consumer, and it would be possible that our process could fail after
we have read messages into our in-memory buffer but before they had
been inserted into the database.
Source: https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

max.poll.intervals.ms set to int.Max by default

Apache Kafka documentation states:
The internal Kafka Streams consumer max.poll.interval.ms default value
was changed from 300000 to Integer.MAX_VALUE
Since this value is used to detect when the processing time for a batch of records exceeds a given threshold, is there a reason for such an "unlimited" value?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Does it enable applications to become unresponsive? Or Kafka Streams has a different way to leave the consumer group when the processing is taking too long?
Kafka Streams leverages a heartbeat functionality of the Kafka consumer client in this context, and thus decouples heartbeats ("Is this app instance still alive?") from calls to poll(). The two main parameters are session.timeout.ms (for the heartbeat thread) and max.poll.interval.ms (for the processing thread), and their difference is described in more detail at https://stackoverflow.com/a/39759329/1743580.
The heartbeating was introduced so that an application instance may be allowed to spent a lot of time processing a record without being considered "not making progress" and thus "be dead". For example, your app can do a lot of crunching for a single record for a minute, while still heartbeating to Kafka "Hey, I'm still alive, and I am making progress. But I'm simply not done with the processing yet. Stay tuned."
Of course you can change max.poll.interval.ms from its default (Integer.MAX_VALUE) to a lower setting if, for example, you actually do want your app instance to be considered "dead" if it takes longer than X seconds in-between polling records, and thus if it takes longer than X seconds to process the latest round of records. It depends on your specific use case whether or not such a configuration makes sense -- in most cases, the default setting is a safe bet.
session.timeout.ms: The timeout used to detect consumer failures when using Kafka's group management facility. The consumer sends periodic heartbeats to indicate its liveness to the broker. If no heartbeats are received by the broker before the expiration of this session timeout, then the broker will remove this consumer from the group and initiate a rebalance. Note that the value must be in the allowable range as configured in the broker configuration by group.min.session.timeout.ms and group.max.session.timeout.ms.
max.poll.interval.ms: The maximum delay between invocations of poll() when using consumer group management. This places an upper bound on the amount of time that the consumer can be idle before fetching more records. If poll() is not called before expiration of this timeout, then the consumer is considered failed and the group will rebalance in order to reassign the partitions to another member.