How to fix 'Kafka Offset commit failed on partition: The request timed out' - apache-kafka

I suddenly got exceptions of type in production Kafka
ERROR[pool-XX-thread-YY] org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer clientId=someclientid, groupId=somegroup] Offset commit failed on partition SomeTopic-SomePartition at offset SomeOffset: The request timed out.
This occurred for 3.5 seconds from a lot of different services (clients) (different threads\different topics\different partitions) and than just self healed...
The offset commit configurations is 5 second auto-commit for all those clients.
Couldn't track anything from kafka broker logs except of some re-balancing right of one group (out of 10 had that issue) which is normal when heart-beating fails, in the metrics server I can see some spikes of commit latency which is the symptom I guess and some TCP spikes on 1 broker (out of 3)
How can I start investigate it? what can cause such an issue? where should I look when things like that is occurring?
Attaching photos of some graphs here:
TCP Spike in server-3
Commit latency spike
Group syncs
Heartbeats

Related

Kafka commit failed due to unsuccessful group coordinator rediscovery

I have an issue with the commit in one of my services. It uses consumer.assign, not subscribe. After processing messages it commits offsets in kafka using commitAsync. Sometimes (one in a few days) commit failed with a RetriableCommitFailedException and in logs I see messages like this:
[Consumer clientId=my-client-id, groupId=my-group-id] Offset commit failed on partition my-topic-28 at offset 283259051: The request timed out.
[Consumer clientId=my-client-id, groupId=my-group-id] Group coordinator 10.54.116.10:9093 (id: 2147483643 rack: null) is unavailable or invalid due to cause: error response REQUEST_TIMED_OUT.isDisconnected: false. Rediscovery will be attempted.
For some reason sometimes this rediscovery has no effect and after 10 minutes of retrying commit is still failing.
At first, I thought that this is somehow related to the fact that I'm using assign, not subscribe. And I somehow receive rebalance that I don't handle properly. But according to the javadocs ConsumerRebalanceListener is not working with the assign, so the problem itself not with the rebalance.
Also, admins said that all kafka nodes was fine when I received an error, and partition leader was not changing.
At the current moment, I have no clue in what direction should I move? Why commit fail even after 10 minutes of retrying? Why group coordinator rediscovery failed sometimes?
I'm using java client 2.8.0, broker version is 2.3.1.

KSQL | Consumer lag | confluent cloud |

I am using kafka confluent cloud as a message queue in the eco-system. There are 2 topics, A and B.
Messages in B arrives a little later after messages of A is being published. ( in a delay of 30 secs )
I am joining these 2 topics using ksql, ksql server is deployed in in-premises and is connected to confluent cloud. In the KSQL i am joining these 2 topics as streams based on the common identifier, say requestId and create a new stream C. C is the joined stream.
At a times, C steam shows it has generated a lag it has not processed messages of A & B.
This lag is visible in the confluent cloud UI. When i login to ksql server i could see following error and after restart of ksql server everything works fine. This happens intermittently in 2 - 3 days.
Here is my configuration in the ksql server which is deployed in in-premises.
# A comma separated list of the Confluent Cloud broker endpoints
bootstrap.servers=${bootstrap_servers}
ksql.internal.topic.replicas=3
ksql.streams.replication.factor=3
ksql.logging.processing.topic.replication.factor=3
listeners=http://0.0.0.0:8088
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="${bootstrap_auth_key}" password="${bootstrap_secret_key}";
# Schema Registry specific settings
ksql.schema.registry.basic.auth.credentials.source=USER_INFO
ksql.schema.registry.basic.auth.user.info=${schema_registry_auth_key}:${schema_registry_secret_key}
ksql.schema.registry.url=${schema_registry_url}
#Additinoal settings
ksql.streams.producer.delivery.timeout.ms=2147483647
ksql.streams.producer.max.block.ms=9223372036854775807
ksql.query.pull.enable.standby.reads=false
#ksql.streams.num.standby.replicas=3 // TODO if we need HA 1+1
#num.standby.replicas=3
# Automatically create the processing log topic if it does not already exist:
ksql.logging.processing.topic.auto.create=true
# Automatically create a stream within KSQL for the processing log:
ksql.logging.processing.stream.auto.create=true
compression.type=snappy
ksql.streams.state.dir=${base_storage_directory}/kafka-streams
Error message in the ksql server logs.
[2020-11-25 14:08:49,785] INFO stream-thread [_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-2] State transition from RUNNING to PARTITIONS_ASSIGNED (org.apache.kafka.streams.processor.internals.StreamThread:220)
[2020-11-25 14:08:49,790] ERROR [Consumer clientId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3-consumer, groupId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0] Offset commit failed on partition yes01-0 at offset 32606388: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1185)
[2020-11-25 14:08:49,790] ERROR [Consumer clientId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3-consumer, groupId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0] Offset commit failed on partition yes01-0 at offset 32606388: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1185)
[2020-11-25 14:08:49,790] WARN stream-thread [_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3] Detected that the thread is being fenced. This implies that this thread missed a rebalance and dropped out of the consumer group. Will close out all assigned tasks and rejoin the consumer group. (org.apache.kafka.streams.processor.internals.StreamThread:572)
org.apache.kafka.streams.errors.TaskMigratedException: Consumer committing offsets failed, indicating the corresponding thread is no longer part of the group; it means all tasks belonging to this thread should be migrated.
at org.apache.kafka.streams.processor.internals.TaskManager.commitOffsetsOrTransaction(TaskManager.java:1009)
at org.apache.kafka.streams.processor.internals.TaskManager.commit(TaskManager.java:962)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:851)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:714)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510)
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1251)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1158)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1132)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1107)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:206)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:169)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:129)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:602)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:412)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
Edit :
During this exception. i have verified the ksql server has enough RAM and CPU

Kakfa retries Concept - What Basis retries will be stopped in Kafka?

As am new to Kafka , trying to understand the retries concept in Kafka . What basis retries process will be completed ?
Example Retries parameter we set as 7 . Now questions here ,
Kafka will be retried in all 7 times ?
Will be tried until successful process ? If so , How Kafka will come to know about successful ?
If that would be depends upon any parameter what Is that parameter and how ?
In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.
Kafka will retry until the initiated process is successfully completed or retry count is zero.
Kafka maintains the status of each API call ( producer , consumer, and Streams ), and if the error condition meets then retry count is decreased.
Please go through the completeBatch function of the Sender.java in the following URL to get more information.
https://github.com/apache/kafka/blob/68ac551966e2be5b13adb2f703a01211e6f7a34b/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java
I guess you are talking about producer retrying to send failed messages.
From kafka producer retries property documentation -
"Setting a value greater than zero will cause the client to resend any
record whose send fails with a potentially transient error."
This means that kafka producer will retry if the error it encountered is considered "Retriable". not all errors are retriable - for example, if the target kafka topic does not exist, theres no point in trying to send the message again.
but if for example the connection was interrupted, it makes sense to try again.
Important to note - retries are only relevant if you have set broker ack != 0.
So, in your example you have 7 retries configured.
I assume that ack is set to a value different than 0 because then no retries will be attempted.
If your message failed with a non-retriable error, Kafka producer will not try to send the message again (it will actually 'give-up' on that message and move on to next messages).
If your message failed with a retriable error, Kafka producer will retry sending until message is successfully sent, or until retries are exhausted (when 7 retries were attempted and none of them succeeded).
Kafka client producer knows when your message was successfully sent to broker because when ack is set to 1\all, the kafka broker is "Acknowledging" any message received and informs the producer (in a kind of handshake between the producer and broker).
see acks & retries # https://kafka.apache.org/documentation/#producerconfigs
Kafka reties happens for transient exceptions such as NotEnoughReplicaException.
In Kafka version <=2.0 default retry is 0.
In Kafka version > 2.0 default retry is Integer.MAX
From kafka 2.1 retries are bounded to timeouts, there are couple of producer configuration such as.
delivery.timeout.ms=120000ms - by default producer will retry for 2 mins, if retry is not successful after 2 mins the request will not send to broker and we have to handle manually.
retry.backoff.ms=100ms - by default every 100ms producer will retry till delivery.timeout reaches.

Kafka Consumer left consumer group

I've faced some problem using Kafka. Any help is much appreciated!
I have zookeeper and kafka cluster 3 nodes each in docker swarm. Kafka broker configuration you can see below.
KAFKA_DEFAULT_REPLICATION_FACTOR: 3
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 3
KAFKA_MIN_INSYNC_REPLICAS: 2
KAFKA_NUM_PARTITIONS: 8
KAFKA_REPLICA_SOCKET_TIMEOUT_MS: 30000
KAFKA_REQUEST_TIMEOUT_MS: 30000
KAFKA_COMPRESSION_TYPE: "gzip"
KAFKA_JVM_PERFORMANCE_OPTS: "-XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:G1HeapRegionSize=16M -XX:MinMetaspaceFreeRatio=50 -XX:MaxMetaspaceFreeRatio=80"
KAFKA_HEAP_OPTS: "-Xmx768m -Xms768m -XX:MetaspaceSize=96m"
My case:
20x Producers producing messages to kafka topic constantly
1x Consumer reads and log messages
Kill kafka node (docker container stop) so now cluster has 2 nodes of Kafka broker (3rd will start and join cluster automatically)
And Consumer not consuming messages anymore because it left consumer group due to rebalancing
Does exist any mechanism to tell consumer to join group after rebalancing?
Logs:
INFO 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] Attempt to heartbeat failed since group is rebalancing
WARN 1 --- [ | loggingGroup] o.a.k.c.c.internals.AbstractCoordinator : [Consumer clientId=kafka-consumer-0, groupId=loggingGroup] This member will leave the group because consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
#Rostyslav Whenever we make a call by consumer to read a message, it does 2 major calls.
Poll
Commit
Poll is basically fetching records from kafka topic and commit tells kafka to save it as a read message , so that it's not read again. While polling few parameters play major role:
max_poll_records
max_poll_interval_ms
FYI: variable names are per python api.
Hence, whenever we try to read message from Consumer, it makes a poll call every max_poll_interval_ms and the same call is made only after the records fetched (as defined in max_poll_records) are processed. So, whenever, max_poll_records are not processed in max_poll_inetrval_ms, we get the error.
In order to overcome this issue, we need to alter one of the two variable. Altering max_poll_interval_ms ccan be hectic as sometime it may take longer time to process certain records, sometime lesser records. I always advice to play with max_poll_records as a fix to the issue. This works for me.

Kafka UNKNOWN_PRODUCER_ID exception

I sometimes find UNKNOWN_PRODUCER_ID exception when using kafka streams.
2018-06-25 10:31:38.329 WARN 1 --- [-1-1_0-producer] o.a.k.clients.producer.internals.Sender : [Producer clientId=default-groupz-7bd94946-3bc0-4400-8e73-7126b9b9c0d4-StreamThread-1-1_0-producer, transactionalId=default-groupz-1_0] Got error produce response with correlation id 1996 on topic-partition default-groupz-mplat-five-minute-stat-urlCount-counts-store-changelog-0, retrying (2147483646 attempts left). Error: UNKNOWN_PRODUCER_ID
Referred to official documents:
This exception is raised by the broker if it could not locate the
producer metadata associated with the producerId in question. This
could happen if, for instance, the producer's records were deleted
because their retention time had elapsed. Once the last records of the
producerId are removed, the producer's metadata is removed from the
broker, and future appends by the producer will return this exception.
It says one possibility is that a producer is idle for more than retention time (by default a week) so the producer's metadata will be removed from broker. Are there any other reasons that brokers could not locate producer metadata?
You might be experiencing https://issues.apache.org/jira/browse/KAFKA-7190. As it says in that ticket:
When a streams application has little traffic, then it is possible that consumer purging would delete
even the last message sent by a producer (i.e., all the messages sent by
this producer have been consumed and committed), and as a result, the broker
would delete that producer's ID. The next time when this producer tries to
send, it will get this UNKNOWN_PRODUCER_ID error code, but in this case,
this error is retriable: the producer would just get a new producer id and
retries, and then this time it will succeed.
This issue is also being tracked at https://cwiki.apache.org/confluence/display/KAFKA/KIP-360%3A+Improve+handling+of+unknown+producer
Two reasons might delete your producer's metadata:
The log segments are deleted due to hitting retention time.
The producer state might get expired due to inactivity which is controlled by the setting transactional.id.expiration.ms which defaults to 7 days
So if your Kafka is < 2.4 you can workaround this by increasing the retention time(considering that your system allows that) of your topic's log(e.g 30 days) and to increase the transactional.id.expiration.ms setting( to 24 days) until KIP-360 is released:
log.retention.hours=720
transactional.id.expiration.ms=2073600000
This shall guarantee that for low-traffic topics(messages written rarely than 7 days), your producer's metadata state will remain stored in broker's memory for a longer period, thus decreasing the risk of getting UnknownProducerIdException.