Failed while executing StreamTask 0_1 due to flush state: - ibm-cloud

I'm trying to run a basic pipe from one topic to another using Kafka Streams 0.10.2.1
KStream<ByteBuffer, ByteBuffer> stream = builder
.stream("transactions_load");
stream.to("transactions_fact");
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
If I watch the destination topic, I can see records are produced there. Records are produced for about 1 minute and then the process fails with the error below:
ERROR task [0_19] Error sending record to topic transactions_fact. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl:102)
[2017-10-02 16:30:54,516]org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
ERROR task [0_9] Error sending record to topic transactions_fact. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl:102)
[2017-10-02 16:30:54,519]org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
....
[2017-10-02 16:30:54,650]org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-14: 30068 ms has passed since last append
ERROR task [0_2] Error sending record to topic transactions_fact. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl:102)
[2017-10-02 16:30:54,650]org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-14: 30061 ms has passed since last append
ERROR stream-thread [StreamThread-1] Failed to commit StreamTask 0_0 state: (org.apache.kafka.streams.processor.internals.StreamThread:813)
[2017-10-02 16:31:02,355]org.apache.kafka.streams.errors.StreamsException: task [0_0] exception caught when producing
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129)
at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:280)
at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807)
at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
...
ERROR stream-thread [StreamThread-1] Failed while executing StreamTask 0_19 due to flush state: (org.apache.kafka.streams.processor.internals.StreamThread:503)
[2017-10-02 16:31:02,378]org.apache.kafka.streams.errors.StreamsException: task [0_19] exception caught when producing
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129)
at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:422)
at org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555)
at org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501)
at org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449)
at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
Exception in thread "StreamThread-1" org.apache.kafka.streams.errors.StreamsException: task [0_0] exception caught when producing
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129)
at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:280)
at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807)
at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 24 record(s) for transactions_fact-5: 30012 ms has passed since last append
Some more info:
I am running one instance of the streams app (on my laptop)
I am writing about 400 records per second into the source topic.
The source topic has 20 partitions
The target topic has 20 partitions
The error suggests that the problem is with producing to the target topic? What is the next step in debugging this further?

Related

unable to connect Kafa server from TIBCO

I am getting below error while sending message to kafa topic from TIBCO application.
Error emssage -
2022-06-22T18:27:54,167 INFO [EventAdminThread #17] com.tibco.thor.frwk.Application - TIBCO-THOR-FRWK-300006: Started BW Application [elkkafka:1.0]
org.apache.kafka.common.KafkaException: Expiring 1 record(s) for tibcotopic-1:120001 ms has passed since batch creation
at com.tibco.bw.palette.kafka.runtime.SendActivity.buildOutput(SendActivity.java:1385)
at com.tibco.bw.palette.kafka.runtime.SendActivity$SendActivityExecutor.sendMessages(SendActivity.java:1129)
at com.tibco.bw.palette.kafka.runtime.SendActivity$SendActivityExecutor.run(SendActivity.java:972)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
2022-06-22T18:29:56,055 ERROR [pool-16-thread-1] com.tibco.bw.palette.kafka.runtime.SendActivity - TIBCO-BW-PALETTE-KAFKA-500005: Exception occurred while send message to broker. Expiring 1 record(s) for tibcotopic-1:120001 ms has passed since batch creation.
2022-06-22T18:29:56,095 ERROR [bwEngThread:In-Memory Process Worker-2] com.tibco.bw.core - TIBCO-BW-CORE-500050: The BW process [elkkafka.module.Process] instance faulted, JobId [bw0a100], ProcessInstanceId [bw0a100], ParentProcessInstanceId [-], Module [elkkafka.module:1.0.0.qualifier], Application [elkkafka:1.0].
<CausedBy> TIBCO-BW-CORE-500051: Activity [cKafkaSendMessage] fault.
<CausedBy> com.tibco.bw.palette.kafka.runtime.fault.KafkaPluginException: TIBCO-BW-PALETTE-KAFKA-500005: Exception occurred while send message to broker. Expiring 1 record(s) for tibcotopic-1:120001 ms has passed since batch creation.-{ActivityName=cKafkaSendMessage, ProcessName=elkkafka.module.Process, ModuleName=elkkafka.module}
<CausedBy> org.apache.kafka.common.KafkaException: Expiring 1 record(s) for tibcotopic-1:120001 ms has passed since batch creation
Producer parameters -
I have increased timeout value but no luck.
I am able to telnet from my machine to kafka server .

When Kafka Topic partition reassignment, Flink job fails continuously

env
kafka 1.0.1
flink 1.7.1
trouble
I use topic with 200 partitions. and flink uses this topic.
Recently, I do manual partition reassignment.
When i reassigned partitions, Flink continuosly fails with this error.
error1.
[2021-07-28 18:21:15,926] WARN Attempting to send response via channel for which there is no open connection, connection id ..(kafka.network.Processor)
error2.
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for -126: 30042 ms has passed since batch creation plus linger time
error3.
java.lang.Exception: Error while triggering checkpoint 656 for Source: Custom Source -> Sink: ... (32/200)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1174)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not perform checkpoint 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:570)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.triggerCheckpoint(SourceStreamTask.java:116)
at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1163)
... 5 more
Caused by: java.lang.Exception: Could not complete snapshot 656 for operator Source: Custom Source -> Sink: ... (32/200).
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:422)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1113)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1055)
at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:729)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:641)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:564)
... 7 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 2 record(s) for ...-86: 30049 ms has passed since batch creation plus linger time
And When i restarted failed job, this error occurs continuously.
ClassLoader info: URL ClassLoader:
file: '/blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47' (invalid JAR: /blobStore-29c572a3-4ed4-48a6-b604-d93b7e4a9a10/job_8bd41a7e0690e75bd61d148d89dca963/blob_p-5c10d03a5cbb09c9a9459f1bc2a70804d0b08290-26b5562cbe83b0403b06717637e7ab47 (Too many open files))
Class not resolvable through given classloader.
So I restarted all mesos and flink cluster with zookeeper clearance.
Is there any cause to look for?
There were network issues with certain brokers in the cluster.
If a request for a specific partition is processed slowly due to a network issue, it is expected that the message will be displayed.
Subsequently, the job corresponding to the partition does not work properly, and it seems that the checkpoint issue of flink occurs.
This problem was solved by replacing the equipment of the broker.

Why is my NiFi PublishKafka processor only working with previous versions?

I am using Kafka 2, and for some reason the only NiFi processors that will correctly publish my messages to Kafka are PublishKafka (0_9) and PublishKafka_0_10. The later versions don't push my messages through, which is odd because again, I'm running Kafka 2.1.1.
For more information, when I try to run my FlowFile through the later PublishKafka processors, I get a timeout exception that repeats voluminously.
2019-03-11 16:05:34,200 ERROR [Timer-Driven Process Thread-7] o.a.n.p.kafka.pubsub.PublishKafka_2_0 PublishKafka_2_0[id=6d7f1896-0169-
1000-ca27-cf7f86f22694] PublishKafka_2_0[id=6d7f1896-0169-1000-ca27-
cf7f86f22694] failed to process session due to
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
initializing transactional state in 5000ms.; Processor Administratively
Yielded for 1 sec: org.apache.kafka.common.errors.TimeoutException:
Timeout expired while initializing transactional state in 5000ms.
org.apache.kafka.common.errors.TimeoutException: Timeout expired while
initializing transactional state in 5000ms.
2019-03-11 16:05:34,201 WARN [Timer-Driven Process Thread-7]
o.a.n.controller.tasks.ConnectableTask Administratively Yielding
PublishKafka_2_0[id=6d7f1896-0169-1000-ca27-cf7f86f22694] due to uncaught
Exception: org.apache.kafka.common.errors.TimeoutException: Timeout
expired while initializing transactional state in 5000ms.
My processor settings are the following:
All other configurations are defaults.
Any ideas on why this is happening?

Kafka cluster streams timeouts at high input

I'm running an Kafka cluster with 7 nodes and a lot of stream processing. Now I see infrequent errors in my Kafka Streams applications like at high input rates:
[2018-07-23 14:44:24,351] ERROR task [0_5] Error sending record to topic topic-name. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl) org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,021] ERROR stream-thread [StreamThread-2] Failed to commit StreamTask 0_5 state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,033] ERROR stream-thread [StreamThread-2] Failed while executing StreamTask 0_5 due to flush state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:423) at org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555) at org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501) at org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551) at org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449) at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,039] WARN stream-thread [StreamThread-2] Unexpected state transition from RUNNING to NOT_RUNNING. (org.apache.kafka.streams.processor.internals.StreamThread) Exception in thread "StreamThread-2" org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
If I reduce the input rate (from 20k to 10k events/s) the errors are gone away. So obviously I'm reaching any sort of limit. I have played around with different options (request.timeout.ms, linger.ms and batch.size) but every time the same result.
You seem to have reached some kind of limit. Based on the message 60060 ms has passed since last append I'd assume it's writher thread starvation due to high load, so disk would be the first thing to check:
disk usage - if you're reaching write speed limit, switching from hdd to ssd might help
load distribution - is your traffic split +- equally to all nodes?
CPU load - lots of processing can
we had similar issue.
in our case we had the following configuration for replication and acknowledgement:
replication.factor: 3
producer.acks: all
and under high load the same error occurred multiple times TimeoutException: Expiring N record(s) for topic: N ms has passed since last append.
after removing our custom replication.factor and producer.acks configs (so we now using default values), and this error has disapearred.
Definitely it takes much more time on producer side until leader will receive full set of in-sync replicas to acknowledge the record, and until records replicated with specified replication.factor.
You will be slightly less protected on fault tolerance with default values.
Also potentially consider to increase the number of partitions per topic and number of application nodes (in which your kafka stream logic processed).

kafka-flink : Kafka producer error depending on flink jobs

While I was doing a proof of concept with kafka-flink, I discovered the following : It seems that kafka producer errors could happen due to workload done on flink side ?!
Here are more details:
I have sample files like sample??.EDR made of ~700'000 rows with values like "entity", "value", "timestamp"
I use the following command to create the kafka topic:
~/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 1 --replication-factor 1 --topic gprs
I use the following command to load sample files on topic:
[13:00] kafka#ubu19: ~/fms
% /home/kafka/kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic gprs < ~/sample/sample01.EDR
I have on flink side jobs that aggregate value for each entity with sliding window of 6 hours and 72 hours (aggregationeachsix, aggregationeachsentytwo).
I did three scenarios:
Load files in the topic without any job running
Load files in the topic with aggregationeachsix job running
Load files in the topic with aggregationeachsix and aggregationeachsentytwo jobs running
The results is that first two scenarios are working but for the third scenario, I have the following errors on kafka producer side while loading the files (not always at the same file, it can be the first, second, third or even later file):
[plenty of lines before this part]
[2017-08-09 12:56:53,409] ERROR Error when sending message to topic gprs with key: null, value: 35 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 233 record(s) for gprs-0: 1560 ms has passed since last append
[2017-08-09 12:56:53,409] ERROR Error when sending message to topic gprs with key: null, value: 37 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 233 record(s) for gprs-0: 1560 ms has passed since last append
[2017-08-09 12:56:53,409] ERROR Error when sending message to topic gprs with key: null, value: 37 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 233 record(s) for gprs-0: 1560 ms has passed since last append
[2017-08-09 12:56:53,412] WARN Got error produce response with correlation id 1627 on topic-partition gprs-0, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2017-08-09 12:56:53,412] WARN Got error produce response with correlation id 1626 on topic-partition gprs-0, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2017-08-09 12:56:53,412] WARN Got error produce response with correlation id 1625 on topic-partition gprs-0, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2017-08-09 12:56:53,412] WARN Got error produce response with correlation id 1624 on topic-partition gprs-0, retrying (2 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2017-08-09 12:56:53,515] ERROR Error when sending message to topic gprs with key: null, value: 35 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 8 record(s) for gprs-0: 27850 ms has passed since batch creation plus linger time
[2017-08-09 12:56:53,515] ERROR Error when sending message to topic gprs with key: null, value: 37 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 8 record(s) for gprs-0: 27850 ms has passed since batch creation plus linger time
[plenty of lines after this part]
My question is why flink could have an impact on kafka producer and then, what do I need to change to avoid this error ?
It looks like you are saturating your network when both flink and kafka-producer are using it and thus you get TimeoutExceptions.