I'm running an Kafka Streams app with version 2.1.0. I found after running for some time, my app (63 nodes) will enter ERROR state one by one. Eventually, all 63 nodes are down.
The exception is :
ERROR o.a.k.s.p.i.ProcessorStateManager - task [2_2] Failed to
flush state store KSTREAM-REDUCE-STATE-STORE-0000000014:
org.apache.kafka.streams.errors.StreamsException: task [2_2]
Abort sending since an error caught with a previous record
(key 110646599468 value InterimMessage [sessionStart=1567150872690,count=1]
timestamp 1567154490411) to topic item.interim due to
org.apache.kafka.common.errors.TimeoutException: Failed to update
metadata after 60000 ms.
You can increase producer parameter `retries` and `retry.backoff.ms`
to avoid this error.
I enabled the DEBUG logging and found out that the exception happens when the KStream ask for metadata update for internal topics only but not the destination topic. (item.interim is the destination topic)
Normally,
[Producer clientId=client-autocreate-StreamThread-1-producer] Sending metadata
request (type=MetadataRequest,
topics=item.interim,test-KSTREAM-REDUCE-STATE-STORE-0000000014-changelog)
to node XXX:9092 (id: 7 rack: XXX)
But before the exception, it was
[Producer clientId=client-autocreate-StreamThread-1-producer] Sending metadata
request (type=MetadataRequest,
topics=test-KSTREAM-REDUCE-STATE-STORE-0000000014-changelog)
to node XXX:9092 (id: 7 rack: XXX)
Config I have changed:
max.request.size=14000000
receive.buffer.bytes=32768
auto.offset.reset=latest
enable.auto.commit=false
default.api.timeout.ms=180000
cache.max.bytes.buffering=10485760
retries=20
retry.backoff.ms=80000
request.timeout.ms=120000
commit.interval.ms=100
num.stream.threads=1
session.timeout.ms=30000
I'm really confused. Could anyone help me to understand, why producer will send different Metadata request? And any possible way to solve the problem? Thanks a lot!
Related
After migrating our microservice functionality to Spring Cloud function we have been facing issues with one of the producer topics.
Event of type: abc and key: xxx_yyy could not be sent to kafka org.springframework.messaging.MessageHandlingException: error occurred in message handler [org.springframework.cloud.stream.binder.kafka.KafkaMessageChannelBinder$ProducerConfigurationMessageHandler#2333d598]; nested exception is org.springframework.kafka.KafkaException: Send failed; nested exception is org.apache.kafka.common.errors.TimeoutException: Topic pc-abc not present in metadata after 60000 ms.
o.s.kafka.support.LoggingProducerListener - Exception thrown when sending a message with key='byte[15]' and payload='byte[256]' to topic pc-abc and partition 6: org.apache.kafka.common.errors.TimeoutException: Topic pc-abc not present in metadata after 60000 ms.
FYI: Topics are already created in our staging/prod environment and are not to be created as the application starts.
My producer config:
spring.cloud.stream.bindings.pc-abc-out-0.content-type=application/json
spring.cloud.stream.bindings.pc-abc-out-0.destination=pc-abc
spring.cloud.stream.bindings.pc-abc-out-0.producer.header-mode=headers
***spring.cloud.stream.bindings.pc-abc-out-0.producer.partition-count=5***
spring.cloud.stream.bindings.pc-abc-out-0.producer.partitionKeyExpression=payload.key
spring.cloud.stream.kafka.bindings.pc-abc-out-0.producer.sync=true
I am kind of stuck at this point and exhausted. Has anyone else faced this issue?
Spring Cloud version: 2.5.5
Kafka: 2.7.1
The issue is :
The producer is configured with partition-count=5
and Kafka is looking for partition number 6 , which obviously does not exist , I have commented the auto-add partitions property, but the issue still turns up !! Is it stale configuration? How do I force kafka to take up new configuration.
I have an issue with the commit in one of my services. It uses consumer.assign, not subscribe. After processing messages it commits offsets in kafka using commitAsync. Sometimes (one in a few days) commit failed with a RetriableCommitFailedException and in logs I see messages like this:
[Consumer clientId=my-client-id, groupId=my-group-id] Offset commit failed on partition my-topic-28 at offset 283259051: The request timed out.
[Consumer clientId=my-client-id, groupId=my-group-id] Group coordinator 10.54.116.10:9093 (id: 2147483643 rack: null) is unavailable or invalid due to cause: error response REQUEST_TIMED_OUT.isDisconnected: false. Rediscovery will be attempted.
For some reason sometimes this rediscovery has no effect and after 10 minutes of retrying commit is still failing.
At first, I thought that this is somehow related to the fact that I'm using assign, not subscribe. And I somehow receive rebalance that I don't handle properly. But according to the javadocs ConsumerRebalanceListener is not working with the assign, so the problem itself not with the rebalance.
Also, admins said that all kafka nodes was fine when I received an error, and partition leader was not changing.
At the current moment, I have no clue in what direction should I move? Why commit fail even after 10 minutes of retrying? Why group coordinator rediscovery failed sometimes?
I'm using java client 2.8.0, broker version is 2.3.1.
When kafka-streams app is running and Kafka is suddenly down, the app enters into "waiting" mode , the consumers and producers threads sending warning logs on them not be able to connect, and when Kafka is back, everything should (theoretically) go back to normal.
I'm trying to get an alert on this situation and I'm not able to find the place to catch that and send log/metric.
I tried the following:
streams.setUncaughtExceptionHandler but this occurs only on exceptions which is not the case here
extending ProductionExceptionHandler and change default.production.exception.handler property to my class which extend this interface. again, as with setUncaughtExceptionHandler there is not exception being thrown here so nothing really happens.
I know Kafka has its own metrics which I can listen to and find if broker is down. but there can be a situations where Kafka brokers are just fine and the my kafka-streams app is not able to connect(i.e bad authentication configuration or vpn/vpc issues)
what can I do to catch those issues and log them /report them ?
update
see the consumer/producer logs in case of kafka not available:
2020-08-24 21:41:32,055 [my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1] WARN o.apache.kafka.clients.NetworkClient - [] [Consumer clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-consumer, groupId=my-kafka-streams-app] Bootstrap broker localhost:9092 (id: -1 rack: null) disconnected
2020-08-24 21:41:32,186 [kafka-admin-client-thread | my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-admin] WARN o.apache.kafka.clients.NetworkClient - [] [AdminClient clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-admin] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
2020-08-24 21:41:32,250 [kafka-producer-network-thread | my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-producer] WARN o.apache.kafka.clients.NetworkClient - [] [Producer clientId=my-kafka-streams-app-23a462fe-28c6-415a-a08a-b11d3ffffc2e-StreamThread-1-producer] Connection to node -1 (localhost/127.0.0.1:9092) could not be established. Broker may not be available.
This case is not easy to detect programmatically. The problem is, that the clients don't really expose their state to Kafka Streams, and thus Kafka Streams does not really know about the disconnect. There is KIP that proposes to add a DISCONNECT state, but it's not easy to implement (cf https://cwiki.apache.org/confluence/display/KAFKA/KIP-457%3A+Add+DISCONNECTED+status+to+Kafka+Streams).
The exception handler you mention don't help for this situation, as no exception is thrown (at least not within the Kafka Streams code base).
What you can try is to monitor consumer lag or some Kafka Streams metrics (like processing rate). They might provide a good enough proxy. Cf https://docs.confluent.io/current/streams/monitoring.html
I have standalone confluent server that worked fine until suddenly connect stooped to consume records .
I enabled trace on connect and this is the debug output .
DEBUG Added READ_UNCOMMITTED fetch request for partition connect-offsets-10 at offset 0 to node 10.1.*.*:9092 (id: 1001 rack: null) (org.apache.kafka.clients.consumer.internals.Fetcher:787)
I did some research and found out that it`s can be related to KIP-62.
I try to reconfigure server proporties with those values but i got same result.
group.initial.rebalance.delay.ms=0
session.timeout.ms=10000
heartbeat.interval.ms=3000
Now the connect service is in deadlock state and unable to consume records.
please set max.poll.records to minimal value and it will solve your problem.
If polled data in an batch is taking more time than session timeout , re balancing will be triggered and same set of data is read over and over again.
We have a kafka cluster with three brokers (node ids 0,1,2) and a zookeeper setup with three nodes.
We created a topic "test" on this cluster with 20 partitions and replication factor 2. We are using Java producer API to send messages to this topic. One of the kafka broker intermittently goes down after which it is unrecoverable. To simulate the case, we killed one of the broker manually. As per the kafka arch, it is supposed to self recover, but which is not happening. When I describe the topic on the console, I see the number of ISR's reduced to one for few of the partitions as one of the broker killed. Now, whenever we are trying to push messages via the producer API (either Java client or console producer), we are encountering SocketTimeoutException.. One quick look into the logs says, "Unable to fetch the metadata"
WARN [2015-07-01 22:55:07,590] [ReplicaFetcherThread-0-3][] kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-3],
Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 23711; ClientId: ReplicaFetcherThread-0-3;
ReplicaId: 0; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [zuluDelta,2] -> PartitionFetchInfo(11409,1048576),[zuluDelta,14] -> PartitionFetchInfo(11483,1048576).
Possible cause: java.nio.channels.ClosedChannelException
[2015-07-01 23:37:40,426] WARN Fetching topic metadata with correlation id 0 for topics [Set(test)] from broker [id:1,host:abc-0042.yy.xxx.com,port:9092] failed (kafka.client.ClientUtils$)
java.net.SocketTimeoutException
at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:201)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221)
at kafka.utils.Utils$.read(Utils.scala:380)
at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
Any leads will be appreciated...
From your error Unable to fetch metadata it could mostly be because you could have set the bootstrap.servers in the producer to the broker that has died.
Ideally, you must have more than one broker in the bootstrap.servers list because if one of the broker fails (or is unreachable) then the other could give you the metadata.
FYI: Metadata is the information about a particular topic that tells how many number of partitions it has, their leader brokers, follower brokers etc.
So, when a key is produced to a partition, its corresponding leader broker will be the one to whom the messages will be sent to.
From your question, your ISR set has only one broker. You could try setting the bootstrap.server to this broker.