Failed to lock the state directory for task 0_13 - apache-kafka

I am facing a very weird issue with Kafka streams, under heavy load when a rebalancing happens my kafka streams application keep getting stuck with the following error showing up in logs repeatedly:
org.apache.kafka.streams.errors.LockException: stream-thread [metricsvc-metric-space-aggregation-9f4389a2-85de-43dc-a45c-3d4cc66150c4-StreamThread-1] task [0_13] Failed to lock the state directory for task 0_13
at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:91) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamTask.initializeIfNeeded(StreamTask.java:216) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.TaskManager.tryToCompleteRestoration(TaskManager.java:433) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:849) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:731) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:583) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:556) ~[kafka-streams-2.8.1.jar:?]
I am debugging some old code written by a developer in our org who is no longer with our company and this part is running into some issues. Unfortunately the code is not very well documented. In this part of the code he has tried to override some of the kafka streams WindowedStore and ReadOnlyWindowedStore classes for optimazation. I understand it is quite difficult to find the root cause without looking at the complete code but is there something really obvious that I should be looking at to solve this?
I am currently running 4 kubernetes pods for this service and all of them have their independent state directory.
I am expecting to not get the error above and even if it happens kafka streams should recover from this error gracefully, but it doesn't happen in our case.

Are there multiple StreamThread instances per POD? Then you could be affected by https://issues.apache.org/jira/browse/KAFKA-12679

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

Possible Stuckness: Google Cloud PubSub to Cloud Storage

I have a Dataflow streaming job that writes PubSub messages to a file that gets stored in Cloud Storage in 3-minute windows. After a few hours I notice on the "Data Freshness by stages" graph it displays "Possible Stuckness" and "Possible slowness".
I have checked the logs and the info logs displays the follow: "Setting socket default timeout to 60 seconds."; "socket default timeout is 60.0 seconds."; "Attempting refresh to obtain initial access_token."; "Refreshing due to a 401 (attempt 1/2)". That last log kept repeating every few minutes for four hours before the job displayed that there was possible slowness/stuckness.
I am not entirely sure what is happening here. Are these logs related to why the job slowed down and got stuck?
The "potential stuckness" and "potential slowness" are basically the same thing, they are documented here.
The logs might be red herrings.
You can view all available logs following here by their categories: job-message, worker, worker-startup and etc. Try
identify if there is any worker logs to determine whether workers are successfully started with dependencies installed;
search "Operation ongoing" to see whether any work item is taking too much time;
search if there is any error in workers that is blocking the streaming job from making progress.

what caused druid tasks failed

I had set up druid cluster(10 nodes),ingestion kafka data using indexing service.However,I found many of tasks are failed like below,but some data had been existed in segments,I am not sure if all datas are pushed in the segments.
failed task lists
besides that,I choose some logs of failed tasks,found there are no fatal error messages,I posted the log file, please help me what caused the task failed.thank so much.
one log of failed tasks
there are 2 questions I want to ask,one is how to confirm all consumer data are pushed in the segments,the other is what caused the task Failure.
This looks to be the issue of Hadoop, where multiple threads trying to write to the same file at same time, you need to set overwrite=false
Check if you are running multiple ingestion tasks for same segments.
you can refer below link for further debugging it -
https://community.hortonworks.com/questions/139150/no-lease-on-file-inode-5425306-file-does-not-exist.html

How can I reproduce akka quarantie situation?

Dealing with akka cluster sometimes I get quarantine situation with such exception:
2016-03-22 10:01:37.090UTC WARN [ClusterSystem-akka.actor.default-dispatcher-2] Remoting - Association to [akka.tcp://ClusterSystem#10.10.80.26:2551] having UID [1417400423] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
The problem is that I can't reproduce it!
I just can't write any simple code snippet which provoces quarantine event.
How is there any simple way to provoce quarantine event?
Just start two seed nodes and stop the network of one of the seed node, you will see this issue.
I have reproduced this issue. I created multiple VMs to form a cluster and stopped network from one of the VM.

AkkaCluster: Terminated vs MemberRemoved message

I am using akka clusters and having trouble figuring out when to use the Terminated message to figure out that a member has left the cluster vs using the UnreachableMember/MemberRemoved message. What is the difference between the 2 messages and some example use cases?
MemberRemoved is a cluster event, that member completely removed from the cluster.
You can subscribe to change notifications of the cluster membership by using Cluster(system).subscribe.
On the other hand, there is a DeathWatch, a mechanism to watch an actor to check if it is terminated. When DeathWatch is used, the watcher will receive a Terminated(watched) message when watched is terminated.