what caused druid tasks failed - apache-kafka

I had set up druid cluster(10 nodes),ingestion kafka data using indexing service.However,I found many of tasks are failed like below,but some data had been existed in segments,I am not sure if all datas are pushed in the segments.
failed task lists
besides that,I choose some logs of failed tasks,found there are no fatal error messages,I posted the log file, please help me what caused the task failed.thank so much.
one log of failed tasks
there are 2 questions I want to ask,one is how to confirm all consumer data are pushed in the segments,the other is what caused the task Failure.

This looks to be the issue of Hadoop, where multiple threads trying to write to the same file at same time, you need to set overwrite=false
Check if you are running multiple ingestion tasks for same segments.
you can refer below link for further debugging it -
https://community.hortonworks.com/questions/139150/no-lease-on-file-inode-5425306-file-does-not-exist.html

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

AWS DMS task is failing in CDC with broken connection error

We are using AWS DMS to do a live migration between AWS RDS to AWS RDS.This task is in Full load+CDC mode. Full load is completed successfully but CDC is now failing with below error:
Last Error Load utility network error. Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2883] [1020458] Error executing source loop; Stream component failed at subtask 0, component st_0_ADWXVXURDV4UXYIGPH5US2PQW6XSQVFD5K4NFAY; Stream component 'st_0_ADWXVXURDV4UXYIGPH5US2PQW6XSQVFD5K4NFAY' terminated [reptask/replicationtask.c:2891] [1020458] Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
In the cloudwatch I can only see below error:
WAL reader terminated with broken connection / recoverable error. [1020458].
I am not sure what might be happening here and my only guess is to fix this I may need to run CDC again with custom checkpoint. Can anyone help me on this?
I tried debugging this issue with further logging levels and also tested the connectivities. I looked into cloudwatch metrics but nothing seems suspicious. Also do note that, CDC did start successfully but has now entered into failed state.

Failed to lock the state directory for task 0_13

I am facing a very weird issue with Kafka streams, under heavy load when a rebalancing happens my kafka streams application keep getting stuck with the following error showing up in logs repeatedly:
org.apache.kafka.streams.errors.LockException: stream-thread [metricsvc-metric-space-aggregation-9f4389a2-85de-43dc-a45c-3d4cc66150c4-StreamThread-1] task [0_13] Failed to lock the state directory for task 0_13
at org.apache.kafka.streams.processor.internals.StateManagerUtil.registerStateStores(StateManagerUtil.java:91) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamTask.initializeIfNeeded(StreamTask.java:216) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.TaskManager.tryToCompleteRestoration(TaskManager.java:433) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.initializeAndRestorePhase(StreamThread.java:849) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:731) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:583) ~[kafka-streams-2.8.1.jar:?]
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:556) ~[kafka-streams-2.8.1.jar:?]
I am debugging some old code written by a developer in our org who is no longer with our company and this part is running into some issues. Unfortunately the code is not very well documented. In this part of the code he has tried to override some of the kafka streams WindowedStore and ReadOnlyWindowedStore classes for optimazation. I understand it is quite difficult to find the root cause without looking at the complete code but is there something really obvious that I should be looking at to solve this?
I am currently running 4 kubernetes pods for this service and all of them have their independent state directory.
I am expecting to not get the error above and even if it happens kafka streams should recover from this error gracefully, but it doesn't happen in our case.
Are there multiple StreamThread instances per POD? Then you could be affected by https://issues.apache.org/jira/browse/KAFKA-12679

Azure Databricks error- The output of the notebook is too large. Cause: rpc response

Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.
See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.

Retry skipped on new execution [duplicate]

Can I restart job and process only skipped items after I have corrected the mistakes in file? I'm reading documentation and not finding currently this possibility. You can restart job if it is failed, but I'm thinking restarting job after it has been completed with some skipped items. If this cannot be achieved with configuration, what would be good way to implement it myself?
What I have done in a case similiar to yours, is to log each skipped item in a file.
Then I created a second job that load the file and process all logged items.