I have written a producer which sends more than 12,000 messages in about 23 seconds, but my consumer seems to be getting only 6k messages per minute.
I added metrics plugin to keep an eye on the number of messages acknowledged in that queue but it's only incrementing at 6k messages per minute
2019-10-23 11:09:46,815 INFO [io.micrometer.core.instrument.logging.LoggingMeterRegistry] artemis.messages.acknowledged{address=my.controller.queue,broker=0.0.0.0,queue=my.controller.queue} value=40156
2019-10-23 11:10:46,818 INFO [io.micrometer.core.instrument.logging.LoggingMeterRegistry] artemis.messages.acknowledged{address=my.controller.queue,broker=0.0.0.0,queue=my.controller.queue} value=46157
message.count for above logs is as follows
2019-10-23 11:09:46,818 INFO [io.micrometer.core.instrument.logging.LoggingMeterRegistry] artemis.unrouted.message.count{address=my.controller.queue,broker=0.0.0.0} value=2
2019-10-23 11:10:46,815 INFO [io.micrometer.core.instrument.logging.LoggingMeterRegistry] artemis.delivering.durable.message.count{address=my.controller.queue,broker=0.0.0.0,queue=my.controller.queue} value=0
I have following connection URL parameters:
?minLargeMessageSize=10485760;compressLargeMessages=true;producerWindowSize=-1;reconnectAttempts=-1;confirmationWindowSize=1048576&consumerWindowSize=-1&throttleRate=-1&consumerMaxRate=-1
I was able to figure out the throttling reason. In our framework code, we were setting
com.google.common.util.concurrent.RateLimite outboundRateLimiter to 100
so only 100*60=6000 per minute, messaging going out every minute.
Related
We have 2 kafka cluster each with 6 nodes active and standby. And a topic with 12 partions and 12 instances of app is running. Our app is a consumer and all consumers use same consumer group ID to receive events from kafka. Event processing from kafka is sequential which is 'event comes in-> process the event->do manual ack'. This event processing takes approx 5 seconds to complete and do manual acknowledge. Though, they are multiple instances only one event at a time will be processed. But, recently we found an issue in production that consumer re balancing is happening for every 2 seconds and due to this message offset commit(manual ack) is being failed and same event is being sent twice and resulting in duplicate record insertion in database.
Kafka consume config values are:
max.poll.interval.ms = 300000//5 mins
max.poll.records = 500
heartbeat.interval.ms=3000
session.timeout.ms=10000
Errors seen:
commit offset failed since the consumer is not part of active group.
2.Time between subsequent calls to poll() was longer than configured max poll interval milli seconds which typically implies that poll loop is spending too much of time message processing.
but message processing is taking 5 seconds not more than configured max poll interval which is 5 mins. As it is sequential processing, only one consumer can poll and get event at once and other instances had to wait till their turn to poll, is it causing above 2 errors and rebalancing? Appreciate the help.
For testing purposes, I posted 5k messages on a Kafka topic and I am using a pull method to read 100 messages every iteration in my spring batch application, it runs for ~2hrs before it finishes.
Facing below error at times and execution is getting stopped.
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before successfully committing offsets
what could be the reason and fix?
Did you consume all the message in two hours? If it is still consuming, MAX_POLL_INTERVAL_MS_CONFIG may be triggered. The default is 5 minutes. If the interval between poll() is more than 5 minutes, it will be kicked out of the consumer group and rebalanced.
During this process, the consumer group is unavailable.
I am not clear about more information, just provide a solution direction.
fix by 20211012:"If it is still consuming, MAX_POLL_INTERVAL_MS_CONFIG may be triggered" means that if the consumption has not been finished, then the consumption rate is very slow, so this mechanism may be triggered. You can adjust this parameter to try to verify.
The core application takes 500ms on average to process a record.
I have tried the below patterns. Unfortunately, I couldn't get rid of rebalancing.
Total events: 3,60,000.
Confluent platform: 3Node cluster, each 2TB disk, SSD disk, Better NIC, VM.
Application node details: Has 24 cores, 96GB Memory, While running the application used memory is 20GB(may be other application also consuming), 400% CPU usage(I have 24 cores).
Source topic detail: 1 topic, 10 partitions.
Here, the entities decide the processing speed of an event.
all - at max 100 entities per record(event,kafka message), average processing time for this is 500ms
20 - at max 20 entities per record(event, kafka message), average processing time is lesser
I referred many forums This answer, KIP 62, Confluent docs, #Matthias J.Sax answers, this blog.
I am having hard time to set these values to avoid rebalancing.
Hearbeat expiration logs:
[2021-07-27 07:11:50,775] INFO [GroupCoordinator 3]: Preparing to rebalance group
abc in state PreparingRebalance with old generation 594 (__consumer_offsets-13)
(reason: removing member abc-990854d8-
f8d7-4b77-9318-2542000258d2-StreamThread-1-consumer-badbdf8b-6705-4319-be8f-57a71d2366ef
on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
The moment max.poll.records reduced to 10, there are no heartbeat expiration. But rebalancing due to metadata change(?)
[2021-07-28 12:09:03,030] INFO [GroupCoordinator 3]: Preparing to rebalance group abc
in state PreparingRebalance with old generation 10 (__consumer_offsets-13)
(reason: Updating metadata for member
abc-e0cb67ba-e587-44f9-844b-746cd498392a-StreamThread-1-consumer-
e875d55d-75b5-4b0a-ad10-cc045223690d during Stable) (kafka.coordinator.group.GroupCoordinator)
The thing which confuses me a lot was heartbeat expiration - no network error, the application didn't crash. Why does this error occur and the application receives the duplicate message at this time[Time matches from the logs(application log, kafka log)]
Another two executions:
Is there any reason for metadata update when none of the thread is down(assuming from the logs, there are no heartbeat expiration)? How do we control this minimum duplicates?
This issue has been eating my head for few days now & have come here for getting some help on figuring out the root cause. Let me elaborate the whole issue.
Problem Statement
I have a KafkaStreams topology which reads JSON strings from a kafka topic. There is a processor which gets these messages & inserts into DynamoDB using AmazonDynamoDBAsyncClient. Thats pretty much the topology does.
More Details
Source kafka topic has 200 partitions. Kafka Streams Topology is configured with 50 Stream threads currently (we previously had set 10,20,30,100,200 as values with no luck)
Issue Being Faced
We are visualizing the lag in kafka topic along with the consumption rate (per minute) in Grafana dashboard. What we see is that, after the Streams process is started there is a steady consumption rate of 300K to 500K Messages per minute for around 5 to 6 mins. After that, the rate drops steeply and stays fixed at 63K per minute. It doesn't go up or down & fixed right at 63K messages per minute.
Parameters configured
Poll_ms - 10000 (10 secs)
Max_Poll_Records - 10000
Max.Partition.Fetch.bytes - 50 MB
commit_ms - 15000 (15 secs)
kafka.consumer.heartbeat.interval - 60 secs
session.timeout.ms - 180000 (3 minutes)
partition.assignment.strategy - org.apache.kafka.clients.consumer.RoundRobinAssignor
AmazonAsyncClient Connection Pool Size - 200 (to match the no. of topic partitions)
DynamoDB Model
We even saw the metrics on the corresponding DynamoDB table & we saw throttling for 10 or 15 secs after which autoscaling kicked in. We saw no capacity issues/errors on Cloudwatch.
Please let me know you more details are needed or if problem statement is unclear. Appreciate the help.
Threaddumps
We checked the threaddumps for any clues. We only see 200 consumer threads on "WAITING" state for polling & there was no threads on BLOCKED state.
I'm running a Kafka cluster with 4 nodes, 1 producer and 1 consumer. It was working fine until consumer failed. Now after I restart the consumer, it starts consuming new messages but after some minutes it throws this error:
[WARN ]: org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - Auto offset commit failed for group eventGroup: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured session.timeout.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records.
And it starts consuming the same messages again and loops forever.
I increased session timeout, tried to change group id and it still does the same thing.
Also is the client version of Kafka consumer a big deal?
I'd suggest you to decouple the consumer and the processing logic, to start with. E.g. let the Kafka consumer only poll messages and maybe after sanitizing the messages (if necessary) delegate the actual processing of each record to a separate thread, then see if the same error is still occurring. The error says, you're spending too much time between the subsequent polls, so this might resolve your issue. Also, please mention the version of Kafka you're using. Kafka had a different heartbeat management policy before version 0.10 which could make this issue easier to reproduce.