Kafka producer quota and timeout exceptions - apache-kafka

I am trying to come up with a configuration that would enforce producer quota setup based on an average byte rate of producer.
I did a test with a 3 node cluster. The topic however was created with 1 partition and 1 replication factor so that the producer_byte_rate can be measured only for 1 broker (the leader broker).
I set the producer_byte_rate to 20480 on client id test_producer_quota.
I used kafka-producer-perf-test to test out the throughput and throttle.
kafka-producer-perf-test --producer-props bootstrap.servers=SSL://kafka-broker1:6667 \
client.id=test_producer_quota \
--topic quota_test \
--producer.config /myfolder/client.properties \
--record.size 2048 --num-records 4000 --throughput -1
I expected the producer client to learn about the throttle and eventually smooth out the requests sent to the broker. Instead I noticed there is alternate throghput of 98 rec/sec and 21 recs/sec for a period of more than 30 seconds. During this time average latency slowly kept increseing and finally when it hits 120000 ms, I start to see Timeout exception as below
org.apache.kafka.common.errors.TimeoutException : Expiring 7 records for quota_test-0: 120000 ms has passed since batch creation.
What is possibly causing this issue?
The producer is hitting timeout when latency reaches 120 seconds (default value of delivery.timeout.ms )
Why isnt the producer not learning about the throttle and quota and slowing down or backing off
What other producer configuration could help alleviate this timeout issue ?

(2048 * 4000) / 20480 = 400 (sec)
This means that, if your producer is trying to send the 4000 records full speed ( which is the case because you set throughput to -1), then it might batch them and put them in the queue.. in maybe one or two seconds (depending on your CPU).
Then, thanks to your quota settings (20480), you can be sure that the broker won't 'complete' the processing of those 4000 records before at least 399 or 398 seconds.
The broker does not return an error when a client exceeds its quota, but instead attempts to slow the client down. The broker computes the amount of delay needed to bring a client under its quota and delays the response for that amount of time.
Your request.timeout.ms being set to 120 seconds, you then have this timeoutException.

Related

"min.fetch.bytes" not Working Properly when used in Conjunction with "fetch.max.wait.ms" in Kafka Consumer

I have an use case where I need to wait for some specific time interval before fetching records from Kafka . But if a min data is present in the topic , I need to get the records immediately and not need to wait for time interval.
I gave the following Kafka Consumer Config Values :
spring:
kafka:
consumer:
auto-offset-reset: latest
properties:
security.protocol: SASL_SSL
sasl.mechanism: PLAIN
ssl.endpoint.identification.algorithm: https
max.poll.interval.ms: 3600000
max.poll.records: 10000
fetch.min.bytes: 2
fetch.max.wait.ms: 60000
request.timeout.ms: 120000
retry:
interval: 1000
max.attempts: 3
I am observing the following -
When there are retryable exceptions faced , the consumer tried to commit the current offsets and reseek current position. But that fetch request also is taking 60 s , even though fetch.min.bytes is set to 2.
Can Some please help here or explain why this behaviour is observed..?
Why are records not returning even though fetch.min.bytes is set to 2 and always waiting till 60 s..? This is especially happening when retryable exceptions are faced.
Added a screenshot of the logs. We can see that after the exception has occured , it takes atleast 1min again for the message to be retried.enter image description here
Note : Spring Kakfa Version i Am using in consumer Side is : org.springframework.kafka:spring-kafka=2.8.8

AWS MSK , Kafka producer throughput relation with number of partitions

Partitions define the unit of parallelism in kafka , but increasing partitions may result in decreased producer throughput as due to replication,cluster bandwidth will decrease.
but in experiments it was observed ,
With 3 brokers: When we take 2 partitions on each broker then performance reduces as compared to 1 partition on each broker.
With 9 brokers: When we take 3 partitions on each broker then performance increases as compared to 1 partition on each broker.
Considering the scenerio of 3 brokers the performance should have degraded but it increased.
What can be the reason for such behaviour ??
Experiment details:
kafka-producer-perf-test was used to do benchmarking
Parameters passed to tool: --num-records 12000000 --throughput -1 acks=1 linger.ms=100 buffer.memory=5242880 compression.type=none request.timeout.ms=30000 --record-size 1000
Results of test in attached image

Kafka streams throwing InvalidProducerException frequently

I have a kafka streams application with 4 instances, each runing on a separate ec2 instance with 16 threads. Total threads = 16 * 4. The input topic has only 32 partitions. I understand that some of the threads will remain idle.
I am continously seeing this exception
Caused by: org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
01:57:23.971 [kafka-producer-network-thread | bids_kafka_streams_beta_007-fd78c6fa-62bc-437d-add0-c31f5b7c1901-StreamThread-12-1_6-producer] ERROR org.apach
e.kafka.streams.processor.internals.RecordCollectorImpl - stream-thread [bids_kafka_streams_beta_007-fd78c6fa-62bc-437d-add0-c31f5b7c1901-StreamThread-12] t
ask [1_6] Error encountered sending record to topic kafka_streams_bids_output for task 1_6 due to:
org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
Written offsets would not be recorded and no more records would be sent since the producer is fenced, indicating the task may be migrated out
The only settings I have change in the streams config are the producer configs to reduce CPU usage on brokers
linger.ms=10000
commit.interval.ms=10000
Records are windowed by 2 mins
Is it due to rebalancing? Why so frequent?

Apache Kafka: Lowering `request.timeout.ms` causes metadata fetch failures?

I have a 9 broker, 5 node Zookeeper Kafka setup.
In order to reduce the time for reporting failures, we had set request.timeout.ms to 3000. However, with this setting, I'm observing some weird behavior.
Occasionally, I'm seeing client (producer) getting an error:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
This doesn't happen always. There are some producers that work just fine.
When I bumped up the request.timeout.ms value, I didn't see any errors.
Any idea why lowering request.timeout.ms cause metadata fetch timeouts?

What does ProducerPerformance Tool in Kafka give?

What does running following Kafka tool actually give ?
./bin/kafka-run-class.sh org.apache.kafka.tools.ProducerPerformance --throughput=10000--topic=TOPIC--num-records=50000000 --record-size=200 --producer-props bootstrap.servers=SERVERS buffer.memory=67108864 batch.size=64000
When running with a single producer I get 90MB/s. When I use 3 separate producers on separate nodes I get only around 60 MB/s per producer. ( My Kafka cluster consists of 2 nodes, and topic has 6 partitions )
What does 90 MB/s mean? Is it the maximum rate at which a producer can produce?
Does partition count affect this value?
Why it drops to 60 MB/s when there are 3 producers ( still no network saturation on broker front)?
Thank you