Kafka cluster streams timeouts at high input - apache-kafka

I'm running an Kafka cluster with 7 nodes and a lot of stream processing. Now I see infrequent errors in my Kafka Streams applications like at high input rates:
[2018-07-23 14:44:24,351] ERROR task [0_5] Error sending record to topic topic-name. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl) org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,021] ERROR stream-thread [StreamThread-2] Failed to commit StreamTask 0_5 state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,033] ERROR stream-thread [StreamThread-2] Failed while executing StreamTask 0_5 due to flush state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:423) at org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555) at org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501) at org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551) at org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449) at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,039] WARN stream-thread [StreamThread-2] Unexpected state transition from RUNNING to NOT_RUNNING. (org.apache.kafka.streams.processor.internals.StreamThread) Exception in thread "StreamThread-2" org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
If I reduce the input rate (from 20k to 10k events/s) the errors are gone away. So obviously I'm reaching any sort of limit. I have played around with different options (request.timeout.ms, linger.ms and batch.size) but every time the same result.

You seem to have reached some kind of limit. Based on the message 60060 ms has passed since last append I'd assume it's writher thread starvation due to high load, so disk would be the first thing to check:
disk usage - if you're reaching write speed limit, switching from hdd to ssd might help
load distribution - is your traffic split +- equally to all nodes?
CPU load - lots of processing can

we had similar issue.
in our case we had the following configuration for replication and acknowledgement:
replication.factor: 3
producer.acks: all
and under high load the same error occurred multiple times TimeoutException: Expiring N record(s) for topic: N ms has passed since last append.
after removing our custom replication.factor and producer.acks configs (so we now using default values), and this error has disapearred.
Definitely it takes much more time on producer side until leader will receive full set of in-sync replicas to acknowledge the record, and until records replicated with specified replication.factor.
You will be slightly less protected on fault tolerance with default values.
Also potentially consider to increase the number of partitions per topic and number of application nodes (in which your kafka stream logic processed).

Related

unable to connect Kafa server from TIBCO

I am getting below error while sending message to kafa topic from TIBCO application.
Error emssage -
2022-06-22T18:27:54,167 INFO [EventAdminThread #17] com.tibco.thor.frwk.Application - TIBCO-THOR-FRWK-300006: Started BW Application [elkkafka:1.0]
org.apache.kafka.common.KafkaException: Expiring 1 record(s) for tibcotopic-1:120001 ms has passed since batch creation
at com.tibco.bw.palette.kafka.runtime.SendActivity.buildOutput(SendActivity.java:1385)
at com.tibco.bw.palette.kafka.runtime.SendActivity$SendActivityExecutor.sendMessages(SendActivity.java:1129)
at com.tibco.bw.palette.kafka.runtime.SendActivity$SendActivityExecutor.run(SendActivity.java:972)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2022-06-22T18:29:56,055 ERROR [pool-16-thread-1] com.tibco.bw.palette.kafka.runtime.SendActivity - TIBCO-BW-PALETTE-KAFKA-500005: Exception occurred while send message to broker. Expiring 1 record(s) for tibcotopic-1:120001 ms has passed since batch creation.
2022-06-22T18:29:56,095 ERROR [bwEngThread:In-Memory Process Worker-2] com.tibco.bw.core - TIBCO-BW-CORE-500050: The BW process [elkkafka.module.Process] instance faulted, JobId [bw0a100], ProcessInstanceId [bw0a100], ParentProcessInstanceId [-], Module [elkkafka.module:1.0.0.qualifier], Application [elkkafka:1.0].
<CausedBy> TIBCO-BW-CORE-500051: Activity [cKafkaSendMessage] fault.
<CausedBy> com.tibco.bw.palette.kafka.runtime.fault.KafkaPluginException: TIBCO-BW-PALETTE-KAFKA-500005: Exception occurred while send message to broker. Expiring 1 record(s) for tibcotopic-1:120001 ms has passed since batch creation.-{ActivityName=cKafkaSendMessage, ProcessName=elkkafka.module.Process, ModuleName=elkkafka.module}
<CausedBy> org.apache.kafka.common.KafkaException: Expiring 1 record(s) for tibcotopic-1:120001 ms has passed since batch creation
Producer parameters -
I have increased timeout value but no luck.
I am able to telnet from my machine to kafka server .

When Kafka Topic partition reassignment, Flink job fails continuously

env
kafka 1.0.1
flink 1.7.1
trouble
I use topic with 200 partitions. and flink uses this topic.
Recently, I do manual partition reassignment.
When i reassigned partitions, Flink continuosly fails with this error.
error1.
[2021-07-28 18:21:15,926] WARN Attempting to send response via channel for which there is no open connection, connection id ..(kafka.network.Processor)
error2.
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for -126: 30042 ms has passed since batch creation plus linger time
error3.
java.lang.Exception: Error while triggering checkpoint 656 for Source: Custom Source -> Sink: ... (32/200)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1174)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not perform checkpoint 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:570)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1163)
... 5 more
Caused by: java.lang.Exception: Could not complete snapshot 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:422)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1113)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1055)
at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:729)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:641)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
... 7 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for ...-86: 30049 ms has passed since batch creation plus linger time
And When i restarted failed job, this error occurs continuously.
ClassLoader info: URL ClassLoader:
file: '/blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47' (invalid JAR: /blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47 (Too many open files))
Class not resolvable through given classloader.
So I restarted all mesos and flink cluster with zookeeper clearance.
Is there any cause to look for?
There were network issues with certain brokers in the cluster.
If a request for a specific partition is processed slowly due to a network issue, it is expected that the message will be displayed.
Subsequently, the job corresponding to the partition does not work properly, and it seems that the checkpoint issue of flink occurs.
This problem was solved by replacing the equipment of the broker.

Kafka: Producer threads get stuck

I have an Apache Nifi workflow that streams data into Kafka. My Kafka cluster is made of 5 nodes that uses SSL for encryption.
When there is a lot of data that is going throw, my Kafka producer (PublishKafkaRecord) freeze and stop working. I have to restart the processor and I am getting Threads errors.
I am using Kafka Confluent 5.3.1.
I am seeing these errors in the Kafka logs:
ERROR Uncaught exception in scheduled task 'transactionalID-expiration' (Kafka.utils.Kafkascheduler)
Retrying leaderEpoch request for partitions XXX-0 as the leader reported an error: UNKNOWN_LEADER_EPOCH (kafka.server.ReplicaFetcherThread)
Could not find offset index file corresponding to log file XXX/*.log recovering segment and rebuilding index files (kafka.log.Log)
ERROR when handing request: .... __transaction_state
ERROR TransactionMetadata (... ) failed: this should not happen (kafka.coordinator.transaction.TransactionMetadata)
I cannot pin point to the actual error.
How can I fix Threads being stuck in Kafka?

Why is my NiFi PublishKafka processor only working with previous versions?

I am using Kafka 2, and for some reason the only NiFi processors that will correctly publish my messages to Kafka are PublishKafka (0_9) and PublishKafka_0_10. The later versions don't push my messages through, which is odd because again, I'm running Kafka 2.1.1.
For more information, when I try to run my FlowFile through the later PublishKafka processors, I get a timeout exception that repeats voluminously.
2019-03-11 16:05:34,200 ERROR [Timer-Driven Process Thread-7] o.a.n.p.kafka.pubsub.PublishKafka_2_0 PublishKafka_2_0[id=6d7f1896-0169-
1000-ca27-cf7f86f22694] PublishKafka_2_0[id=6d7f1896-0169-1000-ca27-
cf7f86f22694] failed to process session due to
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
initializing transactional state in 5000ms.; Processor Administratively
Yielded for 1 sec: org.apache.kafka.common.errors.TimeoutException:
Timeout expired while initializing transactional state in 5000ms.
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
initializing transactional state in 5000ms.
2019-03-11 16:05:34,201 WARN [Timer-Driven Process Thread-7]
o.a.n.controller.tasks.ConnectableTask Administratively Yielding
PublishKafka_2_0[id=6d7f1896-0169-1000-ca27-cf7f86f22694] due to uncaught
Exception: org.apache.kafka.common.errors.TimeoutException: Timeout
expired while initializing transactional state in 5000ms.
My processor settings are the following:
All other configurations are defaults.
Any ideas on why this is happening?

Failed while executing StreamTask 0_1 due to flush state:

I'm trying to run a basic pipe from one topic to another using Kafka Streams 0.10.2.1
KStream<ByteBuffer, ByteBuffer> stream = builder
.stream("transactions_load");
stream.to("transactions_fact");
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
If I watch the destination topic, I can see records are produced there. Records are produced for about 1 minute and then the process fails with the error below:
ERROR task [0_19] Error sending record to topic transactions_fact. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl:102)
[2017-10-02 16:30:54,516]org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
ERROR task [0_9] Error sending record to topic transactions_fact. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl:102)
[2017-10-02 16:30:54,519]org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
....
[2017-10-02 16:30:54,650]org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-14: 30068 ms has passed since last append
ERROR task [0_2] Error sending record to topic transactions_fact. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl:102)
[2017-10-02 16:30:54,650]org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-14: 30061 ms has passed since last append
ERROR stream-thread [StreamThread-1] Failed to commit StreamTask 0_0 state: (org.apache.kafka.streams.processor.internals.StreamThread:813)
[2017-10-02 16:31:02,355]org.apache.kafka.streams.errors.StreamsException: task [0_0] exception caught when producing
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129)
at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:280)
at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807)
at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
...
ERROR stream-thread [StreamThread-1] Failed while executing StreamTask 0_19 due to flush state: (org.apache.kafka.streams.processor.internals.StreamThread:503)
[2017-10-02 16:31:02,378]org.apache.kafka.streams.errors.StreamsException: task [0_19] exception caught when producing
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129)
at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:422)
at org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555)
at org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501)
at org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449)
at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
Exception in thread "StreamThread-1" org.apache.kafka.streams.errors.StreamsException: task [0_0] exception caught when producing
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129)
at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:280)
at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807)
at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
Some more info:
I am running one instance of the streams app (on my laptop)
I am writing about 400 records per second into the source topic.
The source topic has 20 partitions
The target topic has 20 partitions
The error suggests that the problem is with producing to the target topic? What is the next step in debugging this further?