I have an Apache Nifi workflow that streams data into Kafka. My Kafka cluster is made of 5 nodes that uses SSL for encryption.
When there is a lot of data that is going throw, my Kafka producer (PublishKafkaRecord) freeze and stop working. I have to restart the processor and I am getting Threads errors.
I am using Kafka Confluent 5.3.1.
I am seeing these errors in the Kafka logs:
ERROR Uncaught exception in scheduled task 'transactionalID-expiration' (Kafka.utils.Kafkascheduler)
Retrying leaderEpoch request for partitions XXX-0 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
Could not find offset index file corresponding to log file XXX/*.log recovering segment and rebuilding index files (kafka.log.Log)
ERROR when handing request: .... __transaction_state
ERROR TransactionMetadata (... ) failed: this should not happen (kafka.coordinator.transaction.TransactionMetadata)
I cannot pin point to the actual error.
How can I fix Threads being stuck in Kafka?
Related
I notice the message in kafka broker logs
Partition <topic name> marked as failed (kafka.server.ReplicaFetcherThread)
My question is how and when does Kafka mark a partition as failed.
I need to monitor the logs and take action on messages.
Is this message actionable or something that can be ignored?
I am currently using the confluent platform community license. I started Zookeeper, Kafka and the schema-registry - all are used in local mode. However when starting the schema-registry for the first time, 50 messages are sent and stored inside the __consumer_offset topic (__consumer_offsets-0 to __consumer_offsets-49). Those messages are stored in the kafka-logs and when I am trying to start the services again, it fails. To be more precise: Zookeeper works but Kafka fails with the error:
"ERROR Shutdown broker because all log dirs have failed".
As suggested in some other posts I deleted the log.dirs directory referenced in the zookeeper.properties file and the log.dirs directory referenced in the server.properties file. After doing this I can start kafka again without any error - but the 50 messages are stored in __consumer_offset again when starting the schema-registry and after stopping kafka and trying to start kafka again it fails with the same error.
Any help is greatly appreciated. :)
EDIT:
Above that error theres another error saying:
"ERROR Failed to clean up log for _schemas-0 in dir /mnt/c/Users/Username/Desktop/Big_Data/confluent-6.0.0/kafka-logs due to IOException (kafka.server.LogDirFailureChannel) java.io.IOException: Invalid argument"
and also two warnings:
"WARN [ReplicaManager broker=0] Stopping serving replicas in dir /mnt/c/Users/Username/Desktop/Big_Data/confluent-6.0.0/kafka-logs (kafka.server.ReplicaManager)"
and
"WARN [ReplicaManager broker=0] Broker 0 stopped fetcher for partitions __consumer_offsets-22, ... (all of the 50 offsets are then listed)"
I had my Kafka Connectors paused and upon restarting them got these errors in my logs
[2020-02-19 19:36:00,219] ERROR WorkerSourceTask{id=wem-postgres-source-0} Failed to commit offsets (org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)
************
************
[2020-02-19 19:36:00,216] ERROR WorkerSourceTask{id=wem-postgres-source-0} Failed to flush, timed out while waiting for producer to flush outstanding 2389 messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
I got this error multiple times with the number of outstanding messages changed. Then it stopped and haven't seen it again.
Do I need to take any action here or has Connect retried and committed the offsets and that is why the error has stopped?
Thanks
The error indicates that there are a lot of messages buffered and cannot be flushed before the timeout is reached. To address this issue you can
either increase offset.flush.timeout.ms configuration parameter in your Kafka Connect Worker Configs
or you can reduce the amount of data being buffered by decreasing producer.buffer.memory in your Kafka Connect Worker Configs. This turns to be the best option when you have fairly large messages.
I am using Kafka 2, and for some reason the only NiFi processors that will correctly publish my messages to Kafka are PublishKafka (0_9) and PublishKafka_0_10. The later versions don't push my messages through, which is odd because again, I'm running Kafka 2.1.1.
For more information, when I try to run my FlowFile through the later PublishKafka processors, I get a timeout exception that repeats voluminously.
2019-03-11 16:05:34,200 ERROR [Timer-Driven Process Thread-7] o.a.n.p.kafka.pubsub.PublishKafka_2_0 PublishKafka_2_0[id=6d7f1896-0169-
1000-ca27-cf7f86f22694] PublishKafka_2_0[id=6d7f1896-0169-1000-ca27-
cf7f86f22694] failed to process session due to
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
initializing transactional state in 5000ms.; Processor Administratively
Yielded for 1 sec: org.apache.kafka.common.errors.TimeoutException:
Timeout expired while initializing transactional state in 5000ms.
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
initializing transactional state in 5000ms.
2019-03-11 16:05:34,201 WARN [Timer-Driven Process Thread-7]
o.a.n.controller.tasks.ConnectableTask Administratively Yielding
PublishKafka_2_0[id=6d7f1896-0169-1000-ca27-cf7f86f22694] due to uncaught
Exception: org.apache.kafka.common.errors.TimeoutException: Timeout
expired while initializing transactional state in 5000ms.
My processor settings are the following:
All other configurations are defaults.
Any ideas on why this is happening?
I was running Kafka with 2 borker for cluster.
But I keep getting the WARN log.
I checked all my systems and there was no host using IP 10.8.7.1.
By the way, there was more IPs looks like from zookeeper or broker ?
If I shotdown on of Kafka, the WARNING log will be less
I am not familiar with Kafka and zookeeper, just getting starting and study
Any ideas?
Kafka version: 1.0.1
WARN log similar as below(get this kind of log about 10 secs),
[2018-04-19 09:13:08,342] WARN [SocketServer brokerId=0] Unexpected error from /10.8.7.1; closing connection (org.apache.kafka.common.network.Selector)
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 369295616 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:235)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:196)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:545)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:483)
at org.apache.kafka.common.network.Selector.poll(Selector.java:412)
at kafka.network.Processor.poll(SocketServer.scala:551)
at kafka.network.Processor.run(SocketServer.scala:468)
at java.lang.Thread.run(Thread.java:748)
One possible cause is that a Kafka producer on 10.8.7.1 is attempting to stream 0.369 GB of data in a batch instead of streaming. You may have to trace down the kafka producer and see whats going.
Hope this helps.