Apache kafka lot of WARN log show Unexpected error - apache-kafka

I was running Kafka with 2 borker for cluster.
But I keep getting the WARN log.
I checked all my systems and there was no host using IP 10.8.7.1.
By the way, there was more IPs looks like from zookeeper or broker ?
If I shotdown on of Kafka, the WARNING log will be less
I am not familiar with Kafka and zookeeper, just getting starting and study
Any ideas?
Kafka version: 1.0.1
WARN log similar as below(get this kind of log about 10 secs),
[2018-04-19 09:13:08,342] WARN [SocketServer brokerId=0] Unexpected error from /10.8.7.1; closing connection (org.apache.kafka.common.network.Selector)
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 369295616 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:235)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:196)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:545)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:483)
at org.apache.kafka.common.network.Selector.poll(Selector.java:412)
at kafka.network.Processor.poll(SocketServer.scala:551)
at kafka.network.Processor.run(SocketServer.scala:468)
at java.lang.Thread.run(Thread.java:748)

One possible cause is that a Kafka producer on 10.8.7.1 is attempting to stream 0.369 GB of data in a batch instead of streaming. You may have to trace down the kafka producer and see whats going.
Hope this helps.

Related

Kafka: Producer threads get stuck

I have an Apache Nifi workflow that streams data into Kafka. My Kafka cluster is made of 5 nodes that uses SSL for encryption.
When there is a lot of data that is going throw, my Kafka producer (PublishKafkaRecord) freeze and stop working. I have to restart the processor and I am getting Threads errors.
I am using Kafka Confluent 5.3.1.
I am seeing these errors in the Kafka logs:
ERROR Uncaught exception in scheduled task 'transactionalID-expiration' (Kafka.utils.Kafkascheduler)
Retrying leaderEpoch request for partitions XXX-0 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
Could not find offset index file corresponding to log file XXX/*.log recovering segment and rebuilding index files (kafka.log.Log)
ERROR when handing request: .... __transaction_state
ERROR TransactionMetadata (... ) failed: this should not happen (kafka.coordinator.transaction.TransactionMetadata)
I cannot pin point to the actual error.
How can I fix Threads being stuck in Kafka?

Wrong Kafka consumer_offset

I am currently using the confluent platform community license. I started Zookeeper, Kafka and the schema-registry - all are used in local mode. However when starting the schema-registry for the first time, 50 messages are sent and stored inside the __consumer_offset topic (__consumer_offsets-0 to __consumer_offsets-49). Those messages are stored in the kafka-logs and when I am trying to start the services again, it fails. To be more precise: Zookeeper works but Kafka fails with the error:
"ERROR Shutdown broker because all log dirs have failed".
As suggested in some other posts I deleted the log.dirs directory referenced in the zookeeper.properties file and the log.dirs directory referenced in the server.properties file. After doing this I can start kafka again without any error - but the 50 messages are stored in __consumer_offset again when starting the schema-registry and after stopping kafka and trying to start kafka again it fails with the same error.
Any help is greatly appreciated. :)
EDIT:
Above that error theres another error saying:
"ERROR Failed to clean up log for _schemas-0 in dir /mnt/c/Users/Username/Desktop/Big_Data/confluent-6.0.0/kafka-logs due to IOException (kafka.server.LogDirFailureChannel) java.io.IOException: Invalid argument"
and also two warnings:
"WARN [ReplicaManager broker=0] Stopping serving replicas in dir /mnt/c/Users/Username/Desktop/Big_Data/confluent-6.0.0/kafka-logs (kafka.server.ReplicaManager)"
and
"WARN [ReplicaManager broker=0] Broker 0 stopped fetcher for partitions __consumer_offsets-22, ... (all of the 50 offsets are then listed)"

Received invalid metadata error in produce request on partition topic-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException

We use spring kafka stream producer to produce data to kafka topic. when we did resiliency test, we got the the below error.
`2020-08-28 16:18:35.536 WARN [,,,] 26 --- [ad | producer-3] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-3] Received invalid metadata error in produce request on partition topic1-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now
 log: 2020-08-28 16:18:35.536 WARN [,,,] 26 --- [ad | producer-3] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-3] Got error produce response with correlation id 80187 on topic-partition topic1-0, retrying (4 attempts left). Error: NOT_LEADER_FOR_PARTITION
[Producer clientId=producer-3] Received invalid metadata error in produce request on partition topic1-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now.
The warn should be coming only for the period of time we are running resiliency(broker down/up testing) but these warning happening even after the resiliency test period and happening only for the particular partition(here topic1-0). all the other partitions are working fine.`
this is the producer config we have:
spring.cloud.stream.kafka.binder.requiredAcks=all spring.cloud.stream.kafka.binder.configuration.retries=5 spring.cloud.stream.kafka.binder.configuration.metadata.max.age.ms=3000 spring.cloud.stream.kafka.binder.configuration.max.in.flight.requests.per.connection=1 spring.cloud.stream.kafka.binder.configuration.retry.backoff.ms=10000
we have retry config too and it is retrying to get the proper metadata which you can see the above log but keep getting the same warning for that particular partition. Our kafka team also analyzing this issue. I checked google for any solution but nothing i could find to be useful.
is there any config or anything else missing?
please help me.
Thanks in advance.
This error comes when Kafka is down. Restarting Kafka worked for me! :)

Unable to read additional data from client session id

We have hadoop cluster with 3 kafka machines and 3 zookeeper servers
hadoop version - 2.6.4 ( HORTONWORKS )
under zookeeper logs ( /var/log/zookper )
we saw a million WARN messages like:
2019-06-26 10:48:45,675 [myid:1] - WARN [NIOServerCxn.Factory 0.000.0.0/0.0.0.0:2181:NIOServerCnxn#357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x16b8e15a80ca681, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:745)
What is the meaning of these messages:
caught end of stream exception EndOfStreamException
Unable to read additional data from client sessionid
The real problem is about the kafka machines. We faced the problem about leaders are not balanced and Kafka topics partitions finally get with leader -1
Try to have client retries. It solved my issue. Somehow I had retry set to zero.
For example:
hbase.client.retries.number

How to fix the JAVA Kafka Producer Error "Received invalid metadata error in produce request on partition" and Out of Memory when broker is down

I have been creating a Kafka Producer example using Java. I have been
sending normal data which is just "Test" + Integer as value to Kafka. I
have used the below properties and after I have started the Producer
Client and messages are on the way, during this I am killing the broker
and suddenly receiving the below error message instead of retrying.
Using 3 brokers and topic with 3 partitions and replication factor as 3
and no min-insync-replicas
Below are the properties configured config.put(ProducerConfig.ACKS_CONFIG, "all");
config.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, "1");
config.put(CommonClientConfigs.RETRIES_CONFIG, 60);
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG ,10000);
config.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG ,30000);
config.put(ProducerConfig.MAX_BLOCK_MS_CONFIG ,10000);
config.put(ProducerConfig.MAX_REQUEST_SIZE_CONFIG , 1048576);
config.put(ProducerConfig.BATCH_SIZE_CONFIG, 16384);
config.put(ProducerConfig.LINGER_MS_CONFIG, 0);
config.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 1073741824); // 1GB
and the result when I have killed all my brokers or sometimes one of the
broker is as below
**Error:**
WARN org.apache.kafka.clients.producer.internals.Sender - [Producer
clientId=producer-1] Got error produce response with correlation id 124
on topic-partition testing001-0, retrying (59 attempts left). Error:
NETWORK_EXCEPTION
27791 [kafka-producer-network-thread | producer-1] WARN
org.apache.kafka.clients.producer.internals.Sender - [Producer
clientId=producer-1] Received invalid metadata error in produce request
on partition testing001-0 due to
org.apache.kafka.common.errors.NetworkException: The server disconnected
before a response was received.. Going to request metadata update now
28748 [kafka-producer-network-thread | producer-1] ERROR
org.apache.kafka.common.utils.KafkaThread - Uncaught exception in thread
'kafka-producer-network-thread | producer-1':
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(Unknown Source)
at java.nio.ByteBuffer.allocate(Unknown Source)
at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate
(MemoryPool.java:30)
at org.apache.kafka.common.network.NetworkReceive.readFrom
(NetworkReceive.java:112)
at org.apache.kafka.common.network.KafkaChannel.receive
(KafkaChannel.java:335)
at org.apache.kafka.common.network.KafkaChannel.read
(KafkaChannel.java:296)
at org.apache.kafka.common.network.Selector.attemptRead
(Selector.java:560)
at org.apache.kafka.common.network.Selector.pollSelectionKeys
(Selector.java:496)
at org.apache.kafka.common.network.Selector.poll(Selector.java:425)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:510)
at org.apache.kafka.clients.producer.internals.Sender.run
(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run
(Sender.java:163)
at java.lang.Thread.run(Unknown Source)
I assume you are testing the producer. When a producer connect to the Kafka cluster you will pass all broker IPs and ports as a comma separated string. In your case there are three brokers. When producer try to connect to cluster, as part of initialization cluster controller responds with cluster metadata. Assume your producer only populating message to a single topic. Cluster maintains a leader among brokers for each and every topic. After identify the leader for the topic, your producer only going to communicate to the leader until it is live.
In your testing scenario, you are deliberately killing the broker instances. When it happens kafka cluster need to identify a new leader for your topic and controller has to pass the new meta data to your producer. If the metadata change quite frequently( in your case you may kill another broker mean while) producer may receive invalid metadata.