Kafka - org.apache.kafka.common.errors.NetworkException - apache-kafka

I have a kafka client code which connects to Kafka( Server 0.10.1 and client is 0.10.2) brokers. There are 2 topics with 2 different consumer group in the code and also there is a producer. Getting the NetworkException from the producer code once in a while( once in 2 days, once in 5 days, ...). We see consumer group (Re)joining info in the logs for both the consumer group followed by the NetworkException from the producer future.get() call. Not sure why are we getting this error.
Code :-
final Future<RecordMetadata> futureResponse =
producer.send(new ProducerRecord<>("ping_topic", "ping"));
futureResponse.get();
Exception :-
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:70)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:57)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:25)
Kafka API definition for NetworkException,
"A misc. network-related IOException occurred when making a request.
This could be because the client's metadata is out of date and it is
making a request to a node that is now dead."
Thanks

I had the same error while testing the Kafka Consumer. I was using a sender template for it.
In the consumer configuration I set additionally the following properties:
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 15000);
After sending the Message I added a thread sleep:
ListenableFuture<SendResult<String, String>> future =
senderTemplate.send(MyConsumer.TOPIC_NAME, jsonPayload);
Thread.Sleep(10000).
It was necessary to make the test work, but maybe not suitable for your case.

Related

Kafka Producer Retry and Failed record handling

My requirement as follows -
apart from broker metadata related error -I try to simulate a RecordTooLargeException while sending the message to the Kafka Topic.
For the producer configuration I add acks: all and retries: 5
Also I use addCallback method to send the message.
I received org.apache.kafka.common.errors.RecordTooLargeException: The message is 2000103 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.
but I did not notice any retry ( 5 times ) in the log.
My requirement is retry 5 times , then marked the record as permanent failure and send back to the call back handler - for further reprocess the failed record( ex. send to DLT or DB)
How can I achieve this kind of retry and handling?
It's simple. As per theory KAFKA Producer API doesn't retry on RecordTooLargeException, that means it is a non-retriable exception. If you still want to break this and retry irrespectively, then you can catch that Exception string through the Search String when error returned from the broker and retry from the catch block as many as times you want.
KafkaProducer has two types of errors. Retriable errors are those that can be resolved by sending the message again. For example, a connection error can be resolved because the connection may get reestablished. A “not leader for partition” error can be resolved when a new leader is elected for the partition and the client metadata is refreshed. KafkaProducer can be configured to retry those errors automatically, so the application code will get retriable exceptions only when the number of retries was exhausted and the error was not resolved. Some errors will not be resolved by retrying — for example, “Message size too large.” In those cases, KafkaProducer will not attempt a retry and will return the exception immediately.
-- Kafka: The Definitive Guide 2nd Edition, Chapter 3
RecordTooLargeException is a non-retriable exception, retrying makes no sense if the max.request.size configuration does not change. Therefore, Kafka producer will not attempt a retry and will return the exception immediately. The callback handler will be triggered for further reprocess.

Kafka producer dealing with lost connection to broker

With a producer configuration like below, I am creating a Singleton producer that is used throughout the application:
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka.consul1:9092,kafka.consul2:9092");
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.ACKS_CONFIG, "1");
I am connected to a k8s hosted Kafka cluster. The broker's advertised.listeners is configured to return me the IP addresses and not host names. While normally everything works as expected, the problem occurs when Kafka is restarted, sometimes the IP address changes. Since the producer only knows about the older IP it keeps trying to connect to that host to send messages and none of the messages go through.
I observe that a org.apache.kafka.common.errors.TimeoutException exception is thrown when the send fails. Currently the messages are sent async:
producer.send(data,
(RecordMetadata recordMetadata, Exception e) -> {
if (e != null) {
LOGGER.error("Exception while sending message to kafka", e);
}
});
How should the Timeoutexception be handled now? Given that the producer is shared across the application, closing and recreating in the callback does not sound right.
According to the JavaDocs on the Callback interface the TimeoutException is a retriable Exception that could be handled by increasing the number of retries of the Producer.
In the Kafka documentation you find details on the retries configuration:
retries (Default 0): Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first.

Kafka Producer: Got error produce response with correlation NETWORK_EXCEPTION

We are running kafka in distributed mode across 2 servers.
I'm sending messages to Kafka through Java sdk to a Queue which has Replication factor 2 and 1 partition.
We are running in async mode.
I don't find anything abnormal in Kafka logs.
Can anyone help in finding out what could be cause?
Properties props = new Properties();
props.put("bootstrap.servers", serverAdress);
props.put("acks", "all");
props.put("retries", "1");
props.put("linger.ms",0);
props.put("buffer.memory",10240000);
props.put("max.request.size", 1024000);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, Object> producer = new org.apache.kafka.clients.producer.KafkaProducer<>(props);
Exception trace:
-2017-08-15T02:36:29,148 [kafka-producer-network-thread | producer-1] WARN producer.internals.Sender - Got error produce response with
correlation id 353736 on topic-partition BPA_BinLogQ-0, retrying (0
attempts left). Error: NETWORK_EXCEPTION
You are getting a NETWORK_EXCEPTION so this should tell you that something is wrong with the network connection to the Kafka Broker you were producing toward. Either the broker shutdown or the TCP connection was shutdown for some reason.
A quick code dive shows the most probable cause: lost connection to the upstream broker, what causes the delivery method to fail internally inside a sender (link) - you might want to start logging trace in Sender to confirm that:
if (response.wasDisconnected()) {
log.trace("Cancelled request with header {} due to node {} being disconnected",
requestHeader, response.destination());
for (ProducerBatch batch : batches.values())
completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION, String.format("Disconnected from node %s", response.destination())),
correlationId, now);
}
Now with the batch completed in a non-success fashion, it gets retried, but from the logs you have attached it looks like, you ran out of retries (0 attempts left), so it propagates to your level (link):
if (canRetry(batch, response, now)) {
log.warn(
"Got error produce response with correlation id {} on topic-partition {}, retrying ({} attempts left). Error: {}",
....
reenqueueBatch(batch, now);
}
So the ideas are:
investigate your network connectivity - unfortunately this might mean tracing at least on client-side (esp. NetworkClient that does all the upstream broker management) to see if there's any connection loss;
increase producer's retries value (though newer versions of Kafka set it to MAX_INT or so).

Kafka Producer is not retring when Broker is Down

I have setup up Kafka using version 0.9 with the basic configuration as
1 Broker 1 Topic and 1 Partition.
Below are Producer Configurations that I have added to enable the retry from Producer.
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.RETRIES_CONFIG, 5);
props.put(ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG, 500);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 500);
props.put(ProducerConfig.METADATA_MAX_AGE_CONFIG, 50);
I understand from the documents that
Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error.
Both my Broker & Zookeeper are down and the retry operation is not working.
ERROR o.s.k.s.LoggingProducerListener - Exception thrown when sending a message to topic TestTopic1|
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 500 ms.
I need to know if I am missing anything here for the retry to work.
Resend (retry) works only if you have connection to the Broker and something happened during sending a message.
So, if your Broker is dead, there is no any reason to send message at all - no connection. And that is an exception about.
I think retries should work anyway, even if the broker is down. This is the whole reason to have retries in the first place. Could be a temporary network issue after all.
There is a bug in the Kafka 0.9.0.1 producer which causes retries not to work. See here.
Fixed in 0.9.0.2 (which is not released yet) and 0.10. I'd upgrade the broker to 0.10 and try again.
As #artem answered Kafka producer config is not designed to retry when broker is down. It only retries during transient errors which is pretty much useless to be honest. It beats me why spring-Kafka did not take care of it.
Anyways to solve the situation I handled this with #Retry config with springboot. Checkin this SO answer for details : https://stackoverflow.com/a/65248428/6621377

Apache KafkaProducer throwing TimeoutException when sending a message

I have a KafkaProducer that has suddenly started throwing TimeoutExceptions when I try to send a message. Even though I have set the max.block.ms property to 60000ms, and the test blocks for 60s, the error message I am getting always has a time of less than 200ms. The only time it actually shows 60000ms is if I run it in debug mode and step through the waitOnMetadata method manually.
error example:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 101 ms.
Does anyone know why it would suddenly not be able to update the metadata? I know it's not my implementation of the producer that is faulty, as not only have I not changed it since it was working, if I run my tests on another server they all pass. What server side reasons could there be for this? Should I restart my brokers? And why would the timeout message show an incorrect time if I just let it run?
Producer setup:
val props = new Properties()
props.put("bootstrap.servers", getBootstrapServersFor(datacenter.mesosLocal))
props.put("batch.size","0")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("max.block.ms","60000")
new KafkaProducer[String,String](props)
I tried to use the console producer to see if I could send messages and I got a lot of WARN Error while fetching metadata with correlation id 0 : {metadata-1=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient) message back. After stopping and restarting the broker I was then able to send and consume messages again.