IOError(Stalefile) exception being thrown by Kafka Streams RocksDB - apache-kafka

When running my stateful Kafka streaming applications I'm coming across various different RocksDB Disk I/O Stalefile exceptions. The exception only occurs when I have at least one KTable implementation and it happens at various different times. I've tried countless times to reproduce it but haven't been able to.
App/Environment details:
Runtime: Java
Kafka library: org.apache.kafka:kafka-streams:2.5.1
Deployment: OpenShift
Volume type: NFS
RAM: 2000 - 8000 MiB
CPU: 200 Millicores to 2 Cores
Threads: 1
Partitions: 1 - many
Exceptions encountered:
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while getting value for key from at org.apache.kafka.streams.state.internals.RocksDBStore.get(RocksDbStore.java:301)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error restoring batch to store at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(RocksDbStore.java:636)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while range compacting during restoring at org.apache.kafka.streams.state.internals.RocksDBStore$SingleColumnFamilyAccessor.toggleDbForBulkLoading(RocksDbStore.java:616)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while executing flush from store at org.apache.kafka.streams.state.internals.RocksDBStore.flush(RocksDbStore.java:616)
Apologies for not being able to post the entire stack trace, but all of the above exceptions seem to reference the org.rocksdb.RocksDBException: IOError(Stalefile) exception.
Additional info:
Using a persisted state directory
Kafka topic settings are created with defaults
Running a single instance on a single thread
Exception is raised during gets and writes
Exception is raised when consuming valid data
Exception also occurs on internal repartition topics
I'd really appreciate any help and please let me know if I can provide any further information.

If you are using Posix file system, this error means that the file system returns ESTALE. See description to the code in https://man7.org/linux/man-pages/man3/errno.3.html

Related

how to resolve Java heap space error in kafka stream on deployment environment using kubernetes

I am working on Kafka streaming using Docker. I have deployed this kafka-stream module using a docker image of it on a kubernetes pod. When I started writing data in Kafka topics, it wrote a few records, but after some time, it started showing multiple errors .
I have Kafka topics with partition 6 for every topic and the replication factor is 3.
Below are the errors:
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for test.aggregator.s3-1:316987 ms has passed since batch creation
The broker is either slow or in bad state (like not having enough replicas) in responding the request, or the connection to broker was interrupted sending the request or receiving the response.
Consider overwriting `max.block.ms` and /or `delivery.timeout.ms` to a larger value to wait longer for such scenarios and avoid timeout errors
Exception handler choose to CONTINUE processing in spite of this error but written offsets would not be recorded. (org.apache.kafka.streams.processor.internals.RecordCollectorImpl:221)
Heartbeat thread failed due to unexpected error (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:1392)
java.lang.OutOfMemoryError: Java heap space
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:101) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:27) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.CompactArrayOf.read(CompactArrayOf.java:84) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:114) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:27) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.CompactArrayOf.read(CompactArrayOf.java:84) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.types.Schema.read(Schema.java:114) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.common.protocol.ApiKeys.parseResponse(ApiKeys.java:325) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.NetworkClient.parseStructMaybeUpdateThrottleTimeMetrics(NetworkClient.java:720) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:834) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:553) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:265) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.pollNoWakeup(ConsumerNetworkClient.java:306) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:1321) [org.apache.kafka-kafka-clients-2.6.0.jar:?]
[2022-07-20 10:38:10,977] ERROR [kafka-producer-network-thread | stream-consumer-f7fb108f-6d7c-4736-a69f-a885a3eddc47-StreamThread-2-producer] Uncaught exception in thread 'kafka-producer-network-thread | stream-consumer-f7fb108f-6d7c-4736-a69f-a885a3eddc47-StreamThread-2-producer': (org.apache.kafka.common.utils.KafkaThread:49)
java.lang.OutOfMemoryError: Java heap space
Detected that the thread is being fenced. This implies that this thread missed a rebalance and dropped out of the consumer group. Will close out all assigned tasks and rejoin the consumer group. (org.apache.kafka.streams.processor.internals.StreamThread:572)
org.apache.kafka.streams.errors.TaskMigratedException: Consumer committing offsets failed, indicating the corresponding thread is no longer part of the group; it means all tasks belonging to this thread should be migrated.
at org.apache.kafka.streams.processor.internals.TaskManager.commitOffsetsOrTransaction(TaskManager.java:1009) ~[org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.TaskManager.commit(TaskManager.java:962) ~[org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:851) ~[org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:714) [org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551) [org.apache.kafka-kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510) [org.apache.kafka-kafka-streams-2.6.0.jar:?]
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1251) ~[org.apache.kafka-kafka-clients-2.6.0.jar:?]
Please suggest me what I am missing here,
because I am running the same module in my local environment so it is working fine.

Does kafka connect restart failed task

We have a source connector that reads from rdbms and put to kafka. It uses schema registry with avro schema.
I am finding following exceptions in kafka connect log and schema registry log respectively.
1.
Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:426)
WorkerSourceTask{id=A-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:443)
Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:186)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
.
.
Caused by: org.apache.kafka.connect.errors.DataException: Failed to serialize Avro data from topic A :
at io.confluent.connect.avro.AvroConverter.fromConnectData(AvroConverter.java:91)
at org.apache.kafka.connect.storage.Converter.fromConnectData(Converter.java:63)
.
.
Caused by: org.apache.kafka.common.errors.SerializationException: Error registering Avro schema:
.
.
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Register operation timed out; error code: 50002
.
.
Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:187)
Stopping JDBC source task (io.confluent.connect.jdbc.source.JdbcSourceTask:314)
Closing the Kafka producer with timeoutMillis = 30000 ms.
(org.apache.kafka.clients.producer.KafkaProducer:1182)
2.
Wait to catch up until the offset at 1 (io.confluent.kafka.schemaregistry.storage.KafkaStore:304)
Request Failed with exception (io.confluent.rest.exceptions.DebuggableExceptionMapper:62)
io.confluent.kafka.schemaregistry.rest.exceptions.RestSchemaRegistryTimeoutException: Register operation timed out
at io.confluent.kafka.schemaregistry.rest.exceptions.Errors.operationTimeoutException(Errors.java:132)
.
.
Caused by: io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryTimeoutException: Write to the Kafka store timed out while
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.register(KafkaSchemaRegistry.java:508)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.registerOrForward(KafkaSchemaRegistry.java:553)
.
.
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreTimeoutException: KafkaStoreReaderThread failed to reach target offset within the timeout interval. targetOffset: 3, offsetReached: 1, timeout(ms): 50
0
So basically schema registry before registering schema moves offset to latest and there it time out 500ms.
My question was this.
How can I find why it is not able to read from kafka?
Does the source connector task restart or poll data for the failed task of one connector? Because in later section of the log I see this.
Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:426)
WorkerSourceTask{id=A-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:443)
So eariler it failed after this, but now it is not printing it, which means it passed.
The key thing to note is that when it failed eariler reading, it failed task for only one connector A and others passed. Later I didn't find the exception for the connector A.
If the task is not starting or connector is not polling again, I need to restart task using rest API.
Any help will be greatly appriciated.
Thanks in advance.
Regarding your question title, read the error.
task will not recover until manually restarted
If you have more than one task, you would still expect to see logs from other tasks.
As far as offset commits, source task offsets would not be committed until the task succeeds, and no logs given show something "moving to latest"
The error has nothing to do with reading from Kafka. The error is a timeout in your schema registry client in the AvroConverter, which isn't required for Kafka Connect.

Long Running Kafka Stream Suddenly Died with IllegalArgumentException

I have a Kafka streams application which had been running and processing records for several hours. At some point, all of the threads died with the following message.
Exception in thread "MY_STREAM-29fd50b1-1478-4a82-8014-a42a1eabeb28-StreamThread-7" java.lang.IllegalArgumentException: Assigned partition MY_TOPIC_1-KSTREAM-FLATMAP-0000000015-repartition-24 for non-subscribed topic regex pattern; subscription pattern is MY_TOPIC_1|MY_TOPIC_2|MY_TOPIC_1_V1-KSTREAM-FLATMAP-0000000023-repartition|MY_TOPIC_1_V1-KSTREAM-MAP-0000000024-repartition
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignFromSubscribed(SubscriptionState.java:187)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:220)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:367)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:290)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1149)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:827)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:784)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
It appears that this exception was unrelated to any message processing as our exception handler was not called. Restarting the application servers 'fixed' the problem; the streams are running and processing records again. What caused this issue? How can I prevent this from happening in the future?

UnknownProducerIdException in Kafka streams when enabling exactly once

After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817

Kafka-streams state dir io error

Below error is given after stream run for certain time ? I am not able to find who is responsible for creating .sst file ?
Env:
Kafka version 0.10.0-cp1
scala 2.11.8
org.apache.kafka.streams.errors.ProcessorStateException: Error while executing flush from store agg
at org.apache.kafka.streams.state.internals.RocksDBStore.flushInternal(RocksDBStore.java:424)
at org.apache.kafka.streams.state.internals.RocksDBStore.flush(RocksDBStore.java:414)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.flush(MeteredKeyValueStore.java:165)
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:330)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:247)
at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:446)
at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:434)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:422)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:340)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:218)
Caused by: org.rocksdb.RocksDBException: IO error: /tmp/kafka-streams/pos/0_0/rocksdb/agg/000008.sst: No such file or directory
at org.rocksdb.RocksDB.flush(Native Method)
at org.rocksdb.RocksDB.flush(RocksDB.java:1329)
at org.apache.kafka.streams.state.internals.RocksDBStore.flushInternal(RocksDBStore.java:422)
... 9 more
[2016-06-24 11:13:54,910] ERROR Failed to commit StreamTask #0_0 in thread [StreamThread-1]: (org.apache.kafka.streams.processor.internals.StreamThread:452)
org.apache.kafka.streams.errors.ProcessorStateException: Error while batch writing to store agg
at org.apache.kafka.streams.state.internals.RocksDBStore.putAllInternal(RocksDBStore.java:324)
at org.apache.kafka.streams.state.internals.RocksDBStore.flushCache(RocksDBStore.java:379)
at org.apache.kafka.streams.state.internals.RocksDBStore.flush(RocksDBStore.java:411)
at org.apache.kafka.streams.state.internals.MeteredKeyValueStore.flush(MeteredKeyValueStore.java:165)
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.flush(ProcessorStateManager.java:330)
at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:247)
at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:446)
at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:434)
at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:248)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:228)
Caused by: org.rocksdb.RocksDBException: IO error: /tmp/kafka-streams/pos/0_0/rocksdb/agg/000008.sst: No such file or directory
at org.rocksdb.RocksDB.write0(Native Method)
at org.rocksdb.RocksDB.write(RocksDB.java:546)
at org.apache.kafka.streams.state.internals.RocksDBStore.putAllInternal(RocksDBStore.java:322)
... 9 more
RocksDB is used internally by Kafka Streams to handle operator state -- and RocksDB write some files to disk.
Is it possible, that somebody deleted stuff in /tmp folder, and thus deleted the state of your Kafka Streams application? If yes, configure a different state store location using parameter state.dir (see http://docs.confluent.io/current/streams/developer-guide.html#optional-configuration-parameters)