Kafka producer does not signal that all brokers are unreachable - apache-kafka

When all brokers/node of a cluster are unreachable, the error in the Kafka producer callback is a generic "Topic XXX not present in metadata after 60000 ms".
When I activate the DEBUG log level, I can see that all attempts to deliver the message to any node are failing:
DEBUG org.apache.kafka.clients.NetworkClient - Initialize connection to node node2.url:443 (id: 2 rack: null) for sending metadata request
DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node node2.url:443 (id: 2 rack: null) using address node2.url:443/X.X.X.X:443
....
DEBUG org.apache.kafka.clients.NetworkClient - Disconnecting from node 2 due to socket connection setup timeout. The timeout value is 16024 ms.
DEBUG org.apache.kafka.clients.NetworkClient - Initialize connection to node node0.url:443 (id: 0 rack: null) for sending metadata request
DEBUG org.apache.kafka.clients.NetworkClient - Initiating connection to node node0.url:443 (id: 0 rack: null) using address node0.url:443/X.X.X.X:443
....
DEBUG org.apache.kafka.clients.NetworkClient - Disconnecting from node 0 due to socket connection setup timeout. The timeout value is 17408 ms.
and so on, until, after the deliver timeout, the send() Callback gets the error:
ERROR my.kafka.SenderClass - Topic XXX not present in metadata after 60000 ms.
Unlike bootstrap url, all nodes could be unreachable for example for wrong DNS entries or whatever.
How can the application understand that all nodes were not reachable? This is traced just as DEBUG info and is not avialable to the producer send() callback.
Such an error detail at application level would speed up troubleshoooting.
This error is usually signaled by standard webservice SOAP/REST interface.

The producer only cares about the cluster Controller for bootstrapping and the leaders of the partitions it needs to write to (one of those leaders could be the Controller). That being said, it doesn't need to know about "all" brokers.
How can the application understand that all nodes were not reachable?
If you set acks=1 or acks=all, then the callback should know at least one broker had the data written. If not, there was some error.
You can use an AdminClient outside of the Producer client to describe the topic(s) and fetch metadata about the leader partitions, then use standard TCP socket network requests to try and ping those advertised listeners from Java
FWIW, port 443 should ideally be reserved for HTTPS traffic, not Kafka. Kafka is not a REST/SOAP service.

Related

Remote(WAN) kafka client cannot write data to kafka which is in LAN

I'm trying to configure advertised.listener to receive data from remote host.
Producer running on remote host and sends data to kafka. Kafka running in our LAN. Also there is a port mapping: public_ip:9092--->10.10.128.125:9792. Here, 9092 is external port which maps to 9792 which is kafka broker port.
Bellow is the configuration from server.properties file
listeners=INTERNAL://0.0.0.0:9792,EXTERNAL://0.0.0.0:9092
advertised.listeners=INTERNAL://10.10.128.125:9792,EXTERNAL://external_ip:9092
listener.security.protocol.map=INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
inter.broker.listener.name=INTERNAL
Metadata which was sent from kafka to remote kafka client contains broker's internal ip and port, I can see this from producer log
DEBUG org.apache.kafka.clients.Metadata - [Producer clientId=producer-1] Updated cluster metadata updateVersion 2 to MetadataCache{clusterId='2YXDmEjfR1iP1R1pUDM6qw', nodes={1=10.10.128.125:9792 (id: 1 rack: null)}, partitions=[PartitionMetadata(error=LEADER_NOT_AVAILABLE, partition=Kafka-9, leader=Optional.empty, leaderEpoch=Optional[57], replicas=2, isr=2, offlineReplicas=2)}
08:23:43.835 [kafka-producer-network-thread | producer-1] DEBUG org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] Initiating connection to node 10.10.128.125:9792 (id: 1 rack: null) using address /10.10.128.125
So, producer sends data using internal ip and port from remote host. As a result I cannot receive the data.
Why producer receives metadata with internal ip and port, even after configuring advertised.listener?
Any advice would be very helpful
What is your bootstrap.servers property set to? You have to set bootstrap.servers to one of the broker's external ip:port

Kafka consumer should fail on "Bootstrap broker disconnected"

When a Kafka consumer cannot access the bootstrap broker it indefinitely tries to reconnect with the following message:
WARN NetworkClient - [Consumer clientId=consumer-testGroup-1, groupId=testGroup] Connection to node -1 (localhost/127.0.0.1:9999) could not be established. Broker may not be available.
WARN NetworkClient - [Consumer clientId=consumer-testGroup-1, groupId=testGroup] Bootstrap broker localhost:9999 (id: -1 rack: null) disconnected
What I want is that the consumer throws an exception and aborts the execution. In the docs I couldn't find a property to limit the retries.
Is there a recommended way to implement this behaviour or a property I overlooked?
I am using the KafkaReceiver class from project reactor.

Kafka, streams applications runs for a few days then "WARN broker may not be available"

We are running around 30 kafka-streams applications on kubernetes. They usually run fine for a few days but after some random time some of the applications enter a state
2021-11-05 09:44:01,183 [1-producer] WARN o.a.k.c.NetworkClient - [Producer clientId=appId-cd74cfbc-4c6a-49b4-823d-ac21beae2c27-StreamThread-1-producer] Connection to node -1 (********) could not be established. Broker may not be available.
2021-11-05 09:44:01,183 [amThread-1] WARN o.a.k.c.NetworkClient - [Consumer clientId=appId-cd74cfbc-4c6a-49b4-823d-ac21beae2c27-StreamThread-1-consumer, groupId=groupId] Connection to node -2 (******) could not be established. Broker may not be available.
2021-11-05 09:44:01,183 [amThread-1] WARN o.a.k.c.NetworkClient - [Consumer clientId=appId-cd74cfbc-4c6a-49b4-823d-ac21beae2c27-StreamThread-1-consumer, groupId=groupId] Bootstrap broker ****** (id: -2 rack: null) disconnected
2021-11-05 09:44:01,183 [1-producer] WARN o.a.k.c.NetworkClient - [Producer clientId=cd74cfbc-4c6a-49b4-823d-ac21beae2c27-StreamThread-1-producer] Bootstrap broker ****** (id: -1 rack: null)
These warnings will continue being spammed out in the log for around every 2 minutes and last time it kept going on for 2 days until we realized some apps had stopped processing (I know we are working on setting up some monitoring). In the end the applications wont recover until manually restarted, and since the applications never throw an error it never gets propagated to the liveness probe.
Does anyone know if there are some configs or so that we can add to the kafka-streams library so that the applications stops spamming warnings after a certain amount of times and just throws an error so we can fail and automatically restart?
Also worth mentioning is that the brokers are up and running and other streams applications are up and processing against the same kafka-cluster.

How to fix the JAVA Kafka Producer Error "Received invalid metadata error in produce request on partition" and Out of Memory when broker is down

I have been creating a Kafka Producer example using Java. I have been
sending normal data which is just "Test" + Integer as value to Kafka. I
have used the below properties and after I have started the Producer
Client and messages are on the way, during this I am killing the broker
and suddenly receiving the below error message instead of retrying.
Using 3 brokers and topic with 3 partitions and replication factor as 3
and no min-insync-replicas
Below are the properties configured config.put(ProducerConfig.ACKS_CONFIG, "all");
config.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1");
config.put(CommonClientConfigs.RETRIES_CONFIG, 60);
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG ,10000);
config.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG ,30000);
config.put(ProducerConfig.MAX_BLOCK_MS_CONFIG ,10000);
config.put(ProducerConfig.MAX_REQUEST_SIZE_CONFIG , 1048576);
config.put(ProducerConfig.BATCH_SIZE_CONFIG, 16384);
config.put(ProducerConfig.LINGER_MS_CONFIG, 0);
config.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 1073741824); // 1GB
and the result when I have killed all my brokers or sometimes one of the
broker is as below
**Error:**
WARN org.apache.kafka.clients.producer.internals.Sender - [Producer
clientId=producer-1] Got error produce response with correlation id 124
on topic-partition testing001-0, retrying (59 attempts left). Error:
NETWORK_EXCEPTION
27791 [kafka-producer-network-thread | producer-1] WARN
org.apache.kafka.clients.producer.internals.Sender - [Producer
clientId=producer-1] Received invalid metadata error in produce request
on partition testing001-0 due to
org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.. Going to request metadata update now
28748 [kafka-producer-network-thread | producer-1] ERROR
org.apache.kafka.common.utils.KafkaThread - Uncaught exception in thread
'kafka-producer-network-thread | producer-1':
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(Unknown Source)
at java.nio.ByteBuffer.allocate(Unknown Source)
at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate
(MemoryPool.java:30)
at org.apache.kafka.common.network.NetworkReceive.readFrom
(NetworkReceive.java:112)
at org.apache.kafka.common.network.KafkaChannel.receive
(KafkaChannel.java:335)
at org.apache.kafka.common.network.KafkaChannel.read
(KafkaChannel.java:296)
at org.apache.kafka.common.network.Selector.attemptRead
(Selector.java:560)
at org.apache.kafka.common.network.Selector.pollSelectionKeys
(Selector.java:496)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at org.apache.kafka.clients.producer.internals.Sender.run
(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run
(Sender.java:163)
at java.lang.Thread.run(Unknown Source)
I assume you are testing the producer. When a producer connect to the Kafka cluster you will pass all broker IPs and ports as a comma separated string. In your case there are three brokers. When producer try to connect to cluster, as part of initialization cluster controller responds with cluster metadata. Assume your producer only populating message to a single topic. Cluster maintains a leader among brokers for each and every topic. After identify the leader for the topic, your producer only going to communicate to the leader until it is live.
In your testing scenario, you are deliberately killing the broker instances. When it happens kafka cluster need to identify a new leader for your topic and controller has to pass the new meta data to your producer. If the metadata change quite frequently( in your case you may kill another broker mean while) producer may receive invalid metadata.

Kafka client can't receive messages

I have kafka and zookeeper set up on a remote machine. On that machine I'm able to see below working using the test method on official website.
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic listings-incoming
This is a message
This is another message
> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --listings-incoming --from-beginning
This is a message
This is another message
but when I use my local consumer script it is not working:
bin/kafka-console-consumer.sh —bootstrap-server X.X.X.X:9092 —listings-incoming —from-beginning —consumer-property group.id=group2
Haven't seen messages showing up but what is showing is:
[2017-08-11 14:39:56,425] WARN Auto-commit of offsets {listings-incoming-4=OffsetAndMetadata{offset=0, metadata=''}, listings-incoming-2=OffsetAndMetadata{offset=0, metadata=''}, listings-incoming-3=OffsetAndMetadata{offset=0, metadata=''}, listings-incoming-0=OffsetAndMetadata{offset=0, metadata=''}, listings-incoming-1=OffsetAndMetadata{offset=0, metadata=''}} failed for group group1: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing the session timeout or by reducing the maximum size of batches returned in poll() with max.poll.records. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
*****************update**********************
My zookeeper and kafka is running on the same machine, right now my configuration on advertised.listeners is this:
advertised.listeners=PLAINTEXT://the.machine.ip.address:9092
I tried to change it to:
advertised.listeners=PLAINTEXT://my.client.ip.address:9092
and then run the client side consumer script, it gives error:
[2017-08-11 15:49:01,591] WARN Error while fetching metadata with
correlation id 3 : {listings-incoming=LEADER_NOT_AVAILABLE}
(org.apache.kafka.clients.NetworkClient) [2017-08-11 15:49:22,106]
WARN Bootstrap broker 10.161.128.238:9092 disconnected
(org.apache.kafka.clients.NetworkClient) [2017-08-11 15:49:22,232]
WARN Error while fetching metadata with correlation id 7 :
{listings-incoming=LEADER_NOT_AVAILABLE}
(org.apache.kafka.clients.NetworkClient) [2017-08-11 15:49:22,340]
WARN Error while fetching metadata with correlation id 8 :
{listings-incoming=LEADER_NOT_AVAILABLE}
(org.apache.kafka.clients.NetworkClient) [2017-08-11 15:49:40,453]
WARN Bootstrap broker 10.161.128.238:9092 disconnected
(org.apache.kafka.clients.NetworkClient) [2017-08-11 15:49:40,531]
WARN Error while fetching metadata with correlation id 12 :
{listings-incoming=LEADER_NOT_AVAILABLE}
(org.apache.kafka.clients.NetworkClient)
You probably have not configured your advertised.listeners properly in the brokers server.properties file.
From https://kafka.apache.org/documentation/
advertised.listeners Listeners to publish to ZooKeeper for clients to
use, if different than the listeners above. In IaaS environments, this
may need to be different from the interface to which the broker binds.
If this is not set, the value for listeners will be used.
and in the same documentation
listeners Listener List - Comma-separated list of URIs we will listen
on and the listener names. If the listener name is not a security
protocol, listener.security.protocol.map must also be set. Specify
hostname as 0.0.0.0 to bind to all interfaces. Leave hostname empty to
bind to default interface. Examples of legal listener lists:
PLAINTEXT://myhost:9092,SSL://:9091
CLIENT://0.0.0.0:9092,REPLICATION://localhost:9093
So if advertised.listeners is not set and listeners is just listening to localhost:9092 or 127.0.0.1:9092 or 0.0.0.0:9092 then the clients will be told to connect to localhost when they make a meta-data request to the bootstrap server. That works when the client is actually running in the same machine as the broker but it will fail when you connect remotely.
You should set advertised.listeners to be a fully qualified domain name or public IP address for the host that the broker is running on.
For example
advertised.listeners=PLAINTEXT://kafkabrokerhostname.confluent.io:9092
or
advertised.listeners=PLAINTEXT://192.168.1.101:9092