Lagom: Persistence failure when replaying events for persistenceId - scala

I am using lagom for building a, small sample application. I am following the structure of "hello-world" lagom sample application. But whenever I am trying to run the application, I am facing, bellow error:
[error] a.c.s.PersistentShardCoordinator - Persistence failure when replaying events for persistenceId [/sharding/TwitterEntityCoordinator]. Last known sequence number [0]
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.
[error] a.c.s.PersistentShardCoordinator - Persistence failure when replaying events for persistenceId [/sharding/kafkaProducer-tweetsCoordinator]. Last known sequence number [0]
Still, I am not getting what is the meaning of this error and to resolve them?? Because, I am using Apache Kafka also, and my messages are successfully stored into topic but in logs, this error show every time.

Related

IOError(Stalefile) exception being thrown by Kafka Streams RocksDB

When running my stateful Kafka streaming applications I'm coming across various different RocksDB Disk I/O Stalefile exceptions. The exception only occurs when I have at least one KTable implementation and it happens at various different times. I've tried countless times to reproduce it but haven't been able to.
App/Environment details:
Runtime: Java
Kafka library: org.apache.kafka:kafka-streams:2.5.1
Deployment: OpenShift
Volume type: NFS
RAM: 2000 - 8000 MiB
CPU: 200 Millicores to 2 Cores
Threads: 1
Partitions: 1 - many
Exceptions encountered:
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while getting value for key from at org.apache.kafka.streams.state.internals.RocksDBStore.get(RocksDbStore.java:301)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error restoring batch to store at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(RocksDbStore.java:636)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while range compacting during restoring at org.apache.kafka.streams.state.internals.RocksDBStore$SingleColumnFamilyAccessor.toggleDbForBulkLoading(RocksDbStore.java:616)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while executing flush from store at org.apache.kafka.streams.state.internals.RocksDBStore.flush(RocksDbStore.java:616)
Apologies for not being able to post the entire stack trace, but all of the above exceptions seem to reference the org.rocksdb.RocksDBException: IOError(Stalefile) exception.
Additional info:
Using a persisted state directory
Kafka topic settings are created with defaults
Running a single instance on a single thread
Exception is raised during gets and writes
Exception is raised when consuming valid data
Exception also occurs on internal repartition topics
I'd really appreciate any help and please let me know if I can provide any further information.
If you are using Posix file system, this error means that the file system returns ESTALE. See description to the code in https://man7.org/linux/man-pages/man3/errno.3.html

Kafka Streams shutdown after IllegalStateException: No current assignment for partition

I have a Kafka Streams application that launches and runs successfully. We have 4 instances of the application running. Occasionally one of our instance of the application is legitimately killed which causes several rounds of rebalancing until the old node is replaced.
Sometimes during the rebalance, one ore more previously healthy nodes fail. The logs are indicating that the Streams application transitions into a PENDING_SHUTDOWN state directly after receiving the following exception:
java.lang.IllegalStateException: No current assignment for partition public.chat.message-28
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:256)
at org.apache.kafka.clients.consumer.internals.SubscriptionState.resetFailed(SubscriptionState.java:418)
at org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:621)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:571)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:389)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:292)
at org.apache.kafka.clients.consumer.internals.Fetcher.getAllTopicMetadata(Fetcher.java:275)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1849)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1827)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.refreshChangelogInfo(StoreChangelogReader.java:259)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.initialize(StoreChangelogReader.java:133)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:79)
at org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:328)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:866)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:804)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:773)
Prior to this error we often seem to also recieve some informational logs reporting a disconnect exception:
Error sending fetch request (sessionId=568252460, epoch=7) to node 4: org.apache.kafka.common.errors.DisconnectException
I have a feeling the two are related but I'm unable to reason why at present.
Is anyone able to give me some hints as to what may be causing this issue and any possible solutions?
Additional Info:
Kafka 2.2.1
32 partitions spread evenly across the 4 worker nodes
StreamsConfig settings:
kafkaStreamProps.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
kafkaStreamProps.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
kafkaStreamProps.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
kafkaStreamProps.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 120000);
kafkaStreamProps.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
This looks like it could be related to https://issues.apache.org/jira/browse/KAFKA-9073, which has been fixed in Kafka Streams 2.3.2.
If you can't wait for that release, you could try creating a private build using the changeset from this pull request: https://github.com/apache/kafka/pull/7630/files

Long Running Kafka Stream Suddenly Died with IllegalArgumentException

I have a Kafka streams application which had been running and processing records for several hours. At some point, all of the threads died with the following message.
Exception in thread "MY_STREAM-29fd50b1-1478-4a82-8014-a42a1eabeb28-StreamThread-7" java.lang.IllegalArgumentException: Assigned partition MY_TOPIC_1-KSTREAM-FLATMAP-0000000015-repartition-24 for non-subscribed topic regex pattern; subscription pattern is MY_TOPIC_1|MY_TOPIC_2|MY_TOPIC_1_V1-KSTREAM-FLATMAP-0000000023-repartition|MY_TOPIC_1_V1-KSTREAM-MAP-0000000024-repartition
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignFromSubscribed(SubscriptionState.java:187)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:220)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:367)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:290)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1149)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1115)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:827)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:784)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:750)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:720)
It appears that this exception was unrelated to any message processing as our exception handler was not called. Restarting the application servers 'fixed' the problem; the streams are running and processing records again. What caused this issue? How can I prevent this from happening in the future?

UnknownProducerIdException in Kafka streams when enabling exactly once

After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817

Kafka Stream Startup Issue - org.apache.kafka.streams.errors.LockException

I have a Kafka Streams Application version - 0.11 which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers - version 0.11
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour. Whenever I take any kafka broker down, it throws below Exception:
org.apache.kafka.streams.errors.LockException: task [4_10] Failed to lock the state directory for task 4_10
at org.apache.kafka.streams.processor.internals.ProcessorStateManager.<init>(ProcessorStateManager.java:99)
at org.apache.kafka.streams.processor.internals.AbstractTask.<init>(AbstractTask.java:80)
at org.apache.kafka.streams.processor.internals.StandbyTask.<init>(StandbyTask.java:62)
at org.apache.kafka.streams.processor.internals.StreamThread.createStandbyTask(StreamThread.java:1325)
at org.apache.kafka.streams.processor.internals.StreamThread.access$2400(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$StandbyTaskCreator.createTask(StreamThread.java:313)
at org.apache.kafka.streams.processor.internals.StreamThread$AbstractTaskCreator.retryWithBackoff(StreamThread.java:254)
at org.apache.kafka.streams.processor.internals.StreamThread.addStandbyTasks(StreamThread.java:1366)
at org.apache.kafka.streams.processor.internals.StreamThread.access$1200(StreamThread.java:73)
at org.apache.kafka.streams.processor.internals.StreamThread$RebalanceListener.onPartitionsAssigned(StreamThread.java:185)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:265)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:363)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:310)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:297)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1078)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:582)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:553)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:527)
I have read at few jira issues that cleaningUp the streams might help to fix the issue. But cleaningUp the streams everytime we start the Kafka Stream Application is a right solution or a patch? Also, stream cleanUp will delay the application startup right?
Note: Do I need to call streams.cleanUp() before calling streams.start(), each time I start the Kafka Streams application
Seeing a org.apache.kafka.streams.errors.LockException: task [4_10] Failed to lock the state directory for task 4_10 is actually expected and should resolve itself. The thread will back off in order to wait until another thread releases the lock and retries later. Thus, you might even see this WARN message is the logs multiple time in case the retry happens before the second thread did release the lock.
However, eventually the lock should be release by the second thread and the first thread will be able to get the lock. Afterwards, Streams should just move forward. Note, it's a WARN message and not an error.