Kafka Custom Authorizer - apache-kafka

I use Kafka and using the custom authorizer.
From the custom authorizer, I call a microservice for authorizaton. It works fine for a while and stars throwing the following exception in the logs and the whole cluster becomes unresponsive. The exception keeps coming until i restart the cluster. But, the whole cluster works fine without any issues even for months without the custom authorizer. Is there any bug in the Kafka version or anything wrong with the custom authorizer.
TRACE [ReplicaFetcherThread-0-39], Issuing to broker 1 of fetch request kafka.server.ReplicaFetcherThread$FetchRequest#8c63320 (kafka.server.ReplicaFetcherThread)
[2017-06-30 08:29:17,473] TRACE [ReplicaFetcherThread-2-1], Issuing to broker 1 of fetch request kafka.server.ReplicaFetcherThread$FetchRequest#67a143a (kafka.server.ReplicaFetcherThread)
[2017-06-30 08:29:17,473] WARN [ReplicaFetcherThread-3-1], Error in fetch kafka.server.ReplicaFetcherThread$FetchRequest#12d29e06 (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to <HOST:PORT> (id: 1 rack: null) failed
at kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:83)
at kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:93)
at kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:248)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
My Custom authorizer uses microservice for checking authorization and caches data in a guava caches with the expiry time of 10 mins.

I suggest taking a thread dump to see what all the threads are doing.
Just a guess here, given there isn't much info to go on.
If you have as single cache instance what could be happening is that once the cache expires, all requests start hitting the microservice for authorization info and, since this adds latency, the thread pool gets exhausted. A thread dump can tell you how many threads are calling the microservice simultaneously.
If this is indeed the problem, one of the options you could consider, is to use a separate cache per thread (using a Thread-local variable). That way each thread's cache will expire at its own time and won't cause other threads hitting the microservice at exactly the same time.
Another, and a better way IMO is to remove the blocking calls to the microservice from the authorize code-path completely. Instead of a fall-through cache, have the cache always up to date by updating it from a separate background thread. This way no latency is ever added to the authorize calls.


Records associated to a Kafka batch listener are not consumed for some partitions after several rebalances (resiliency testing)

Some weeks ago my project has been updated to use Kafka 3.2.1 instead of using the one coming with Spring Boot 2.7.3 (3.1.1). We made this upgrade to avoid an issue in Kafka streams – Illegal state and argument exceptions were not ending in the uncaught exception handler.
On the consumer side, we also moved to the cooperative sticky assignator.
In parallel, we started some resiliency tests and we started to have issues with Kafka records that are not consumed anymore on some partitions when using a Kafka batch listener. The issue occurred after several rebalances caused by the test (deployment is done in Kubernetes and we stopped some pods, micro services and broker instances). The issue not present on every listeners. Kafka brokers and micro-services are up and running.
During our investigations,
we enabled Kafka events and we can clearly see that the consumer is started
we can see in the logs that the partitions that are not consuming events are assigned.
debug has been enabled on the KafkaMessageListenerContainer. We see a lot of occurrences of Receive: 0 records and Commit list: {}
Is there any blocking points to use Kafka 3.2.1 with Spring Boot/Kafka 2.7.3/2.8.8?
Any help or other advices are more than welcome to progress our investigations.
Multiple listeners are defined, the retry seems to be fired from another listener (shared err handler?).
This is a known bug, fixed in the next release:
Temporary work around is to give each container its own error handler.

How to handle kafka consumer failures

I am trying understand how to handle failed consumer records. How to
we know there is record failure. What I am seeing is when the record
processing failed in the consumer with runtime exception consumer is
keep on retrying. But when the next record is available to process it
is commiting offset of the latest record, which is expected. My
question how to we know about failed record. In older messaging
systems failed messages are rolled back to queues and processing stops
there. Then we know the queue is down and we can take action.
I can record the failed record into some db table,but what happens if this recording fails?
I can move failures to error/ dead letter queues, again what happens if this moving fails?
I am using kafka 2.6 with spring boot 2.3.4. Any help would be appreciated
Sounds like you would need to disable auto commits and manually commit the offsets yourself when your scope of "sucessfully processed" is achieved. If you include external processes like a database, then you will also need to increase Kafka client timeouts so it doesnt think the consumer is dead while waiting on error logging/handling.

Kafka : Failed to update metadata after 60000 ms with only one broker down

We have a kafka producer configured as -
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.
Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.
I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Reproducing UnknownTopicOrPartitionException: This server does not host this topic-partition

We have encountered few exception on production environment:
UnknownTopicOrPartitionException: This server does not host this topic-partition
As per my analysis, one possible workaround for this issue is increasing no of retries since this is a retriable exception.
I am facing some difficulties which reproducing this issue locally. I tried bringing down broker while producing but it is failing with TimeoutException.
I am looking for suggestions to reproduce this issue.
If you get this error log during topic creation process, there is an open issue for this:
KAFKA-6221 ReplicaFetcherThread throws UnknownTopicOrPartitionException on topic creation
at some point of time during batch creating topics, it's likely that UpdateMetadata requests got processed later than FetchRequest, therefore metadata cache was not updated on a timely basis.
issue was about log messages that have no impact on cluster health.

Connection not cleaned up if client failure occurs abruptly without closing the resources

I am using hornetq-2.2.14 Final and configured connection-ttl in hornetq-jms.xml is 60000ms . I have a publisher program which sends messages to a topic and a Consumer program which consumes messages from the topic. My consumer program exited abruptly without closing the resources. I waited 1 minute since the ttl is 60000ms,but server not clearing up the resources even after one minute. Any one can help me out to resolve this issue, if this is a configuration issue?
Sometimes it can take as 2X TTL depending on how it happened. we recently had a fix on master to make sure it is always close to the TTL configured.