Someone from my team has changed some configuration in Kafka and I don't know what was changed. No one admits to the changes. I have to explain this case.
From this time our applications show errors similar to this despite access to topics set to All:
2022-01-03 09:16:35,398] ERROR [Worker clientId=connect-1, groupId=test-connect] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:290)
org.apache.kafka.common.errors.GroupAuthorizationException: Not authorized to access group: test-connect
Previously, access was set at the topic's and user's level, not at the group level and it worked well.
Do you have any idea what is wrong?
Related
I have an issue with the commit in one of my services. It uses consumer.assign, not subscribe. After processing messages it commits offsets in kafka using commitAsync. Sometimes (one in a few days) commit failed with a RetriableCommitFailedException and in logs I see messages like this:
[Consumer clientId=my-client-id, groupId=my-group-id] Offset commit failed on partition my-topic-28 at offset 283259051: The request timed out.
[Consumer clientId=my-client-id, groupId=my-group-id] Group coordinator 10.54.116.10:9093 (id: 2147483643 rack: null) is unavailable or invalid due to cause: error response REQUEST_TIMED_OUT.isDisconnected: false. Rediscovery will be attempted.
For some reason sometimes this rediscovery has no effect and after 10 minutes of retrying commit is still failing.
At first, I thought that this is somehow related to the fact that I'm using assign, not subscribe. And I somehow receive rebalance that I don't handle properly. But according to the javadocs ConsumerRebalanceListener is not working with the assign, so the problem itself not with the rebalance.
Also, admins said that all kafka nodes was fine when I received an error, and partition leader was not changing.
At the current moment, I have no clue in what direction should I move? Why commit fail even after 10 minutes of retrying? Why group coordinator rediscovery failed sometimes?
I'm using java client 2.8.0, broker version is 2.3.1.
I have Event Hub Namespace with two Event Hubs (event-hub and event-hub-2). To establish connection I use Kafka - of course namespace is with Standard Tier. When I try to connect to the second EH (event-hub-2 as a Kafka Topic, Connection String as a Kafka Password) I got following stacktrace:
2021-06-17T15:56:04.976Z - WARN: [NetworkClient] [Consumer clientId=consumer-$Default-1, groupId=$Default] Error while fetching metadata with correlation id 11 : {event-hub=TOPIC_AUTHORIZATION_FAILED}
2021-06-17T15:56:04.980Z - ERROR: [Metadata] [Consumer clientId=consumer-$Default-1, groupId=$Default] Topic authorization failed for topics [event-hub]
2021-06-17T15:56:05.007Z - ERROR: [KafkaConsumerActor] [9e1ad] Exception when polling from consumer, stopping actor: org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [event-hub]
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [event-hub]
My question is: WHY I could got this kind of stacktrace when I didn't even try to connect to topic/EH from stacktrace? It's a weird...
If you are using the same consumer group in both scenarios, your consumer needs read access to all topics used in the consumer group, try changing the group.id and test again.
The problem came back when I connect my subscribers to Event Hubs simultaneously. Just like Ran said, connecting to different consumer groups resolved problem. Many thanks!
I have a Kafka Streams application that launches and runs successfully. We have 4 instances of the application running. Occasionally one of our instance of the application is legitimately killed which causes several rounds of rebalancing until the old node is replaced.
Sometimes during the rebalance, one ore more previously healthy nodes fail. The logs are indicating that the Streams application transitions into a PENDING_SHUTDOWN state directly after receiving the following exception:
java.lang.IllegalStateException: No current assignment for partition public.chat.message-28
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:256)
at org.apache.kafka.clients.consumer.internals.SubscriptionState.resetFailed(SubscriptionState.java:418)
at org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:621)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:571)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:389)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:292)
at org.apache.kafka.clients.consumer.internals.Fetcher.getAllTopicMetadata(Fetcher.java:275)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1849)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1827)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.refreshChangelogInfo(StoreChangelogReader.java:259)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.initialize(StoreChangelogReader.java:133)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:79)
at org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:328)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:866)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:804)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:773)
Prior to this error we often seem to also recieve some informational logs reporting a disconnect exception:
Error sending fetch request (sessionId=568252460, epoch=7) to node 4: org.apache.kafka.common.errors.DisconnectException
I have a feeling the two are related but I'm unable to reason why at present.
Is anyone able to give me some hints as to what may be causing this issue and any possible solutions?
Additional Info:
Kafka 2.2.1
32 partitions spread evenly across the 4 worker nodes
StreamsConfig settings:
kafkaStreamProps.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
kafkaStreamProps.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
kafkaStreamProps.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
kafkaStreamProps.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 120000);
kafkaStreamProps.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
This looks like it could be related to https://issues.apache.org/jira/browse/KAFKA-9073, which has been fixed in Kafka Streams 2.3.2.
If you can't wait for that release, you could try creating a private build using the changeset from this pull request: https://github.com/apache/kafka/pull/7630/files
Trying to run Kafka Connect for the first time, with an existing Kafka deployment. using SASL_PLAINTEXT and kerberos authentication.
The first time I try and start connect-distributed, I see:
ERROR Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:227)
org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
If I immediately run a second time, not changing anything, instead I see:
ERROR Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:227)
org.apache.kafka.common.errors.TopicAuthorizationException: Not authorized to access topics: [Offsets]
This is reproducible.
Worker config:
producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor
producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
bootstrap.servers=mybroker:9092
rest.port=28082
group.id=some-group
config.storage.topic=Configs
offset.storage.topic=Offsets
status.storage.topic=Status
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
rest.advertised.host.name=localhost
log4j.root.loglevel=INFO
security.protocol=SASL_PLAINTEXT
sasl.kerberos.service.name=kafka
sasl.mechanism=GSSAPI
consumer.security.protocol=SASL_PLAINTEXT
consumer.sasl.kerberos.service.name=kafka
consumer.sasl.mechanism=GSSAPI
producer.security.protocol=SASL_PLAINTEXT
producer.sasl.kerberos.service.name=kafka
producer.sasl.mechanism=GSSAPI
A career in software has taught me to always assume that the problem is completely unrelated to the error log, but for once it was correct:
Ranger was configured incorrectly and I genuinely wasn't authorized to access that topic.
After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817