"min.fetch.bytes" not Working Properly when used in Conjunction with "fetch.max.wait.ms" in Kafka Consumer - apache-kafka

I have an use case where I need to wait for some specific time interval before fetching records from Kafka . But if a min data is present in the topic , I need to get the records immediately and not need to wait for time interval.
I gave the following Kafka Consumer Config Values :
spring:
kafka:
consumer:
auto-offset-reset: latest
properties:
security.protocol: SASL_SSL
sasl.mechanism: PLAIN
ssl.endpoint.identification.algorithm: https
max.poll.interval.ms: 3600000
max.poll.records: 10000
fetch.min.bytes: 2
fetch.max.wait.ms: 60000
request.timeout.ms: 120000
retry:
interval: 1000
max.attempts: 3
I am observing the following -
When there are retryable exceptions faced , the consumer tried to commit the current offsets and reseek current position. But that fetch request also is taking 60 s , even though fetch.min.bytes is set to 2.
Can Some please help here or explain why this behaviour is observed..?
Why are records not returning even though fetch.min.bytes is set to 2 and always waiting till 60 s..? This is especially happening when retryable exceptions are faced.
Added a screenshot of the logs. We can see that after the exception has occured , it takes atleast 1min again for the message to be retried.enter image description here
Note : Spring Kakfa Version i Am using in consumer Side is : org.springframework.kafka:spring-kafka=2.8.8

Related

Kafka : Consumer is taking ~ 2 sec to fetch next batch

I have set kafka config. as below for consumer
consumer:
bootstrap-servers: ${VAR1}
group-id: ${VAR2}
auto-offset-reset: earliest
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
properties:
max:
poll:
records: 20
interval.ms: 300000
listener:
ack-mode: manual
I observed after consuming one batch successfully kafka consumer is taking time to fetch next batch,
It is taking approximately 1.5 ~ 2sec
Can anyone help to understand how to reduce this kafka batch interval of ~2sec and configure to get the next batch immediately

Kafka producer quota and timeout exceptions

I am trying to come up with a configuration that would enforce producer quota setup based on an average byte rate of producer.
I did a test with a 3 node cluster. The topic however was created with 1 partition and 1 replication factor so that the producer_byte_rate can be measured only for 1 broker (the leader broker).
I set the producer_byte_rate to 20480 on client id test_producer_quota.
I used kafka-producer-perf-test to test out the throughput and throttle.
kafka-producer-perf-test --producer-props bootstrap.servers=SSL://kafka-broker1:6667 \
client.id=test_producer_quota \
--topic quota_test \
--producer.config /myfolder/client.properties \
--record.size 2048 --num-records 4000 --throughput -1
I expected the producer client to learn about the throttle and eventually smooth out the requests sent to the broker. Instead I noticed there is alternate throghput of 98 rec/sec and 21 recs/sec for a period of more than 30 seconds. During this time average latency slowly kept increseing and finally when it hits 120000 ms, I start to see Timeout exception as below
org.apache.kafka.common.errors.TimeoutException : Expiring 7 records for quota_test-0: 120000 ms has passed since batch creation.
What is possibly causing this issue?
The producer is hitting timeout when latency reaches 120 seconds (default value of delivery.timeout.ms )
Why isnt the producer not learning about the throttle and quota and slowing down or backing off
What other producer configuration could help alleviate this timeout issue ?
(2048 * 4000) / 20480 = 400 (sec)
This means that, if your producer is trying to send the 4000 records full speed ( which is the case because you set throughput to -1), then it might batch them and put them in the queue.. in maybe one or two seconds (depending on your CPU).
Then, thanks to your quota settings (20480), you can be sure that the broker won't 'complete' the processing of those 4000 records before at least 399 or 398 seconds.
The broker does not return an error when a client exceeds its quota, but instead attempts to slow the client down. The broker computes the amount of delay needed to bring a client under its quota and delays the response for that amount of time.
Your request.timeout.ms being set to 120 seconds, you then have this timeoutException.

Kafka Streams: TimeoutException: Failed to update metadata after 60000 ms

I'm running a Kafka Streams application to consumer one topic with 20 partitions at traffic 15K records/sec. The application does a 1hr windowing and suppress the latest result per window. After running for sometime, it starts getting TimeoutException and then the instance is marked down by Kafka Streams.
Error trace:
Caused by: org.apache.kafka.streams.errors.StreamsException:
task [2_7] Abort sending since an error caught with a previous record
(key 372716656751 value InterimMessage [xxxxx..]
timestamp 1566433307547) to topic XXXXX due to org.apache.kafka.common.errors.TimeoutException:
Failed to update metadata after 60000 ms.
You can increase producer parameter `retries` and
`retry.backoff.ms` to avoid this error.
I already increased the value of that two configs.
retries = 5
retry.backoff.ms = 80000
Should I increase them again as the error message mentioned? What should be a good value for these two values?

Kafka Consumer left consumer group

I've faced some problem using Kafka. Any help is much appreciated!
I have zookeeper and kafka cluster 3 nodes each in docker swarm. Kafka broker configuration you can see below.
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_NUM_PARTITIONS: 8
KAFKA_REPLICA_SOCKET_TIMEOUT_MS: 30000
KAFKA_REQUEST_TIMEOUT_MS: 30000
KAFKA_COMPRESSION_TYPE: "gzip"
KAFKA_JVM_PERFORMANCE_OPTS: "-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80"
KAFKA_HEAP_OPTS: "-Xmx768m -Xms768m -XX:MetaspaceSize=96m"
My case:
20x Producers producing messages to kafka topic constantly
1x Consumer reads and log messages
Kill kafka node (docker container stop) so now cluster has 2 nodes of Kafka broker (3rd will start and join cluster automatically)
And Consumer not consuming messages anymore because it left consumer group due to rebalancing
Does exist any mechanism to tell consumer to join group after rebalancing?
Logs:
INFO 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] Attempt to heartbeat failed since group is rebalancing
WARN 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
#Rostyslav Whenever we make a call by consumer to read a message, it does 2 major calls.
Poll
Commit
Poll is basically fetching records from kafka topic and commit tells kafka to save it as a read message , so that it's not read again. While polling few parameters play major role:
max_poll_records
max_poll_interval_ms
FYI: variable names are per python api.
Hence, whenever we try to read message from Consumer, it makes a poll call every max_poll_interval_ms and the same call is made only after the records fetched (as defined in max_poll_records) are processed. So, whenever, max_poll_records are not processed in max_poll_inetrval_ms, we get the error.
In order to overcome this issue, we need to alter one of the two variable. Altering max_poll_interval_ms ccan be hectic as sometime it may take longer time to process certain records, sometime lesser records. I always advice to play with max_poll_records as a fix to the issue. This works for me.

Apache Kafka: Lowering `request.timeout.ms` causes metadata fetch failures?

I have a 9 broker, 5 node Zookeeper Kafka setup.
In order to reduce the time for reporting failures, we had set request.timeout.ms to 3000. However, with this setting, I'm observing some weird behavior.
Occasionally, I'm seeing client (producer) getting an error:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
This doesn't happen always. There are some producers that work just fine.
When I bumped up the request.timeout.ms value, I didn't see any errors.
Any idea why lowering request.timeout.ms cause metadata fetch timeouts?