Kafka producer timeout exception - apache-kafka

[1] 2022-01-18 21:56:10,280 ERROR [org.apa.cam.pro.err.DefaultErrorHandler] (Camel (camel-1) thread #9 - KafkaProducer[test]) Failed delivery for (MessageId: 95835510BC9E9B2-0000000000134315 on ExchangeId: 95835510BC9E9B2-0000000000134315). Exhausted after delivery attempt: 1 caught: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test-0:121924 ms has passed since batch creation
[1]
[1] Message History (complete message history is disabled)
[1] ---------------------------------------------------------------------------------------------------------------------------------------
[1] RouteId ProcessorId Processor Elapsed (ms)
[1] [route1 ] [route1 ] [from[netty://udp://0.0.0.0:8080?receiveBufferSize=65536&sync=false] ] [ 125320]
[1] ...
[1] [route1 ] [to1 ] [kafka:test?brokers=10.99.155.100:9092&producerBatchSize=0 ] [ 0]
[1]
[1] Stacktrace
[1] ---------------------------------------------------------------------------------------------------------------------------------------
[1] : org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test-0:121924 ms has passed since batch creation
Here's the flow for my project
External Service ---> Netty
Netty ---> Kafka(consumer)
Kafka(producer) ---> processing events
1 and 2 are running in one Kubernetes pod and 3 is running in a separate pod.
I have encountered TimeoutException at the beginning saying like:
org.apache.kafka.common.errors.TimeoutException: Expiring 20 record(s) for test-0:121924 ms has passed since batch creation
I searched online and found a couple of potential solutions
Kafka Producer error Expiring 10 record(s) for TOPIC:XXXXXX: 6686 ms has passed since batch creation plus linger time
Based on the suggestion, I have done:
make the timeout bigger, double the default value
make the batch size to 0, which will not send events in batch and keeps the memory usage low.
Unfortunately I still encounter the error due to memory is used up.
Does anyone know how to solve it?
Thanks!

There are several things to take into account here.
You are not showing up what your throughput is, you have to take into account that value and if your broker on 10.99.155.100:9092 is able to process such load.
Did you check 10.99.155.100 during the time of the transfer? The fact that Kafka can potentially process hundreds of thousands of messages per second doesn't mean that you can do it on any hardware.
So, having said that, the timeout is the first to come to my mind, but in your case you have 2 minutes and still you are timing out, for me, this sounds more like a problem in your broker and not on your producer.
To understand the issue, basically, you are getting your mouth full faster than you can swallow, by the time push a message the broker is not able to acknowledge on time (in this case, 2 minutes)
What things you can do here:
Check the broker performance for the given load Change your
delivery.timeout.ms to an acceptable value, I guess you have SLAs
to attach to Increase your retry backoff timer (retry.backoff.ms)
Do not put the batch size as 0, this will try a live push to the
broker, which in case seems not possible for the load. Make sure your
max.block.ms is set correctly Change to bigger batches (even if this
increases latency), but not too big, you need to sit down, check how
many records you are pushing and allocate them correctly.
Now, some rules:
delivery.timeout.ms must be bigger than the sum of
request.timeout.ms and linger.ms All the above are impacted by
the batch.size If you don't have so many rows, but those rows are
huge! then control the max.request.size
So, to summarize, your properties to change are the following:
delivery.timeout.ms, request.timeout.ms, linger.mx, max.request.size
Assuming the hardware is good and also assuming that you are not sending more than you should, those should do the trick

Related

ArtemisMQ Address Memory Filling Up Too Quickly

I am using ArtemisMQ 2.17. When I look at each of the Queues, the message count for each always shows 0 as we have listeners actively scanning to pick up messages quickly. However, when I look at the overall message count on the broker, it starts to grow from 0 to upwards of millions; this starts to grow about 20 minutes into running the broker.
If I restart the broker service, my address memory is cleared. Sometimes, which I haven't determined the time interval just yet, when I restart the service, I'll see messages show up in the ExpiryQueue. But this doesn't happen everytime I restart the service; only sometimes.
Our application is using Spring JMS for the Producer and Listeners; The addresses/queues are multicast.
Here is what I am seeing on the broker attributes section in the console. I would expect message count to always be 0 but after 20min or so, it starts to climb indefinitely, and the acknowledged and added start to not match up at that point.
What can I do so that messages are picked up and completely removed? OR What's happening here? Where are these messages going and why do they randomly appear in my ExpiryQueue after a few restarts?
Output of: ./artemis queue stat
View in console for broker level attributes:

Will Kafka Producer always waits for the value specified by linger.ms, before sending a request?

As per LINGER_MS_DOC in ProducerConfig java class:
"The producer groups together any records that arrive in between
request transmissions into a single batched request. Normally this
occurs only under load when records arrive faster than they can be
sent out. However in some circumstances the client may want to reduce
the number of requests even under moderate load. This setting
accomplishes this by adding a small amount of artificial delay; that
is, rather than immediately sending out a record the producer will
wait for up to the given delay to allow other records to be sent so
that the sends can be batched together. This can be thought of as
analogous to Nagle's algorithm in TCP. This setting gives the upper
bound on the delay for batching: once we get "BATCH_SIZE_CONFIG" worth
of records for a partition it will be sent immediately regardless of
this setting, however if we have fewer than this many bytes
accumulated for this partition we will 'linger' for the specified time
waiting for more records to show up. This setting defaults to 0 (i.e.
no delay). Setting "LINGER_MS_CONFIG=5" for example, would have the
effect of reducing the number of requests sent but would add up to 5ms
of latency to records sent in the absence of load."
I searched for a suggested value for linger.ms but nowhere found a higher value suggested for this. Most of the places 5 ms is mentioned for linger.ms.
For testing, I have set "batch.size" to 16384 (16 KB)
and "linger.ms" to 60000 (60 seconds)
as per doc I felt if I send a message of size > 16384 bytes then the producer will not wait and send the message immediately, but I am not observing the same behavior.
I am sending events of size > 16384 bytes but it still waits for 60 seconds. Am I missing to understand the purpose of "batch.size"? My understanding of "batch.size" and "linger.ms" is that whichever meets first the messages/batch will be sent.
In this case, if it is going to be the minimum wait time and do not give preference to "batch.size" then I guess setting a high value for linger.ms is not right.
Here is the kafka properties used in yaml:
producer:
properties:
acks: all
retries: 5
batch:
size: 16384
linger:
ms: 10
max:
request:
size: 1046528

Kafka Connect fetch.max.wait.ms & fetch.min.bytes combined not honored?

I'm creating a custom SinkConnector using Kafka Connect (2.3.0) that needs to be optimized for throughput rather than latency. Ideally, what I want is:
Batches of ~ 20 megabytes or 100k records whatever comes first, but if message rate is low, process at least every minute (avoid small batches, but minimum MySinkTask.put() rate to be every minute).
This is what I set for consumer settings in an attempt to accomplish it:
consumer.max.poll.records=100000
consumer.fetch.max.bytes=20971520
consumer.fetch.max.wait.ms=60000
consumer.max.poll.interval.ms=120000
consumer.fetch.min.bytes=1048576
I needs this fetch.min.bytes setting, or else MySinkTask.put() is called for multiple times per second despite the other settings...?
Now, what I observe in a low-rate situation is that MySinkTask.put() is called with 0 records multiple times and several minutes pass by, until fetch.min.bytes is reached, and then I get them all at once.
I fail to understand so far:
Why fetch.max.wait.ms=60000 is not pushing downwards from the consumer to the put() call of my connector? Shouldn't that have precedence over fetch.min.bytes?
What setting controls the ~ 2x per second call to MySinkTask.put() if fetch.min.bytes=1 (default)? I don't understand why it does that, even the verbose output of the Connect runtime settings don't show any interval below multiples of seconds.
I've double-checked the log output, and the lines INFO o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values: as printed by the Connect Runtime are showing the expected values as I pass with the consumer. prefixed values.
The "process at least every interval" part seems not possible, as the fetch.min.bytes consumer setting takes precedence and Connect does not allow you to dynamically adjust the ConsumerConfig while the Task is running. :-(
Work-around for now is batching in the Task manually; set fetch.min.bytes to 1 (yikes), buffer records in the Task on put() calls, and flush when necessary. This is not very ideal as it infers some overhead for the Connector which I hoped to avoid.
The logic how Connect does a ~ 2x per second batching from its consumer's poll to SinkTask.put() remains a mystery to me, but it's better than being called for every message.

Kafka producer future metadata in callback

In my application when I send messages I use the Metadata in the callback to save the offset of the record for future usage. However sometimes the metadata.offset() returns -1 which makes things hard later.
Why does this happen and is there a way to get the offset without consuming the topic to find it.
Edit: I am on ack 0 currently, when I pass to ack 1 I don't have these errors anymore however my performance drops drastically. From 100k message in 10 sec to 1 min.
acks=0 If set to zero then the producer will not wait for any
acknowledgment from the server at all. The record will be immediately
added to the socket buffer and considered sent. No guarantee can be
made that the server has received the record in this case, and the
retries configuration will not take effect (as the client won't
generally know of any failures). The offset given back for each
record will always be set to -1.
This is not exactly true as out of 100k messages I got 95k with offsets but I guess it's normal.
Still will need to find another solution to get the offset with ack=0

Retry logic in kafka consumer

I have a use case where i consume certain logs from a queue and hit some third party API's with some info from that log , in case the third party system is not responding properly i wish to implement a retry logic for that particular log .
I can add a time field and repush the message to the same queue and this message will get again consumed if its time field is valid i.e less than current time and if not then get pushed again into queue.
But this logic will add same log again and again until retry time is correct and the queue will grow unnecessarily.
Is there is better way to implement retry logic in Kafka ?
You can create several retry topics and push failed task there. For instance you can create 3 topics with different delays in mins and rotate the single failed task till the max attempt limit reached.
‘retry_5m_topic’ — for retry in 5 minutes
‘retry_30m_topic’ — for retry in 30 minutes
‘retry_1h_topic’ — for retry in 1 hour
See more for details: https://blog.pragmatists.com/retrying-consumer-architecture-in-the-apache-kafka-939ac4cb851a
In consumer, if it throws an exception, produce another message with attempt number 1. so next time when it is consumed, it has the property of attempt no 1. Handle it in the producer that, if it attempts more than your retry count, then stop producing it.
Yes, this could be one straight solution that I also thought of. But with this, we will end up in creating many topics as it is possible that message processing will fail again.
I solved this problem by mapping this use case to Rabbit MQ. In rabbit MQ we have the concept of retry exchange where if a message processing fails from an exchange then u can send it to a retry exchange with a TTL. Once TTL gets expired the message will move back to the main exchange and is ready to be processed again.
I can post some examples explaining how we can implement an exponential backoff message processing using Rabbit MQ.