Streamsets Pipeline to ingest files to HDFS throwing misleading "File not Found" Exception - streamsets

We have a Streamsets job set up. Which although it runs successfully throws the following error:
"UNKNOWN com.streamsets.pipeline.api.StageException: SPOOLDIR_35 -
Spool Directory Runner Failed. Reason
java.nio.file.NoSuchFileException: "
The error is ‘file not found’ but actually the file is processed successfully and still the error is raised. This happens intermediately and not for all the files that are being processed.
Here's some background about the job:
The pipeline reads files from the linux edge node and ingests them
into HDFS
The error occurs on the ‘read’ stage
We have been running the same pipeline for almost 2 years and have
not seen this issue until the last month or so. Nothing about our
process has changed recently. The intermittent errors seem to
coincide with the latest StreamSets upgrade.
We process about 7
files every 2 hours through this pipeline, so roughly 84 files a day,
and the intermittent error seems to occur on 1-3 files per day. All
files are still processed in to HDFS.
Any idea why this happens?

It looks like you might be hitting SDC-9740. Please watch/vote/comment on this issue, especially if you can provide any more detail that might help us narrow down the cause. It's a P1, so it should be fixed in the next release.

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

Azure Databricks error- The output of the notebook is too large. Cause: rpc response

Error Message - job failed with error message The output of the notebook is too large. Cause: rpc response (of 20972488 bytes) exceeds limit of 20971520 bytes
Details:
We are using databricks notebooks to run the job. Job is running on job cluster. This is a streaming job.
Job started failing with above mentioned error.
We do not have any display(), show(), print(), explain method in the job.
We are not using awaitAnyTermination method in the job as well.
We also tried adding "spark.databricks.driver.disableScalaOutput true" to the job but it still did not work. Job is failing with same error.
We have followed all the steps mentioned in this document - https://learn.microsoft.com/en-us/azure/databricks/kb/jobs/job-cluster-limit-nb-output
Do we have any option to resolve this issue or to find out exactly which commands output is causing it to go above 20MB limit.
See the docs regarding structured streaming in prod.
I would recommend migrating to workflows based on jar jobs because:
Notebook workflows are not supported with long-running jobs. Therefore we don’t recommend using notebook workflows in your streaming jobs.

Dataproc Job not giving any output

I have submitted spark job through airflow sometimes job works and sometimes it don't give output at all .
Even after 2-3 hrs of waiting job is not giving any detail apart from
Waiting for job output...
I am using dataproc-1-4-deb10
Its simple job like pulling data from jdbc using pysaprk. Also it works without error sometimes and sometimes doesnt at all.

Kafka Connect source connector's tasks going up and down (RUNNING or missing) frequently

Occasionally with Kafka Connect, I see my JdbcSourceConnector's task go up and down--that is, the REST interface sometimes reports one task that is RUNNING and sometimes reports no tasks (the connector remains RUNNING this whole time). During these times, the task seems to be working when its running. Then, if I delete and re-create the connector, the problem seems to go away. I suspect something is wrong--that tasks shouldn't churn like this, right? But the INFO/WARN logs on the server don't seem to give me many clues, although there are lots of INFO lines to sort through.
Is it normal for JdbcSourceConnector tasks to oscillate between nonexisting and RUNNING?
Assuming not, what should I look for in the log to help figure it out? (I see lots of INFO lines)
Any idea what could be causing this?
I have monitoring on my REST connectors' statuses, and this one gives me the following (where the value is the number of RUNNING statuses; 2 is Connector RUNNING and task RUNNING, but 1 is Connector RUNNING without a task). Today at 9:01 AM I deleted and created the connector, thus "solving" the problem. Any thoughts?
I have Kafka Connect version "5.5.0-ccs" for use with Confluent platform 5.4, running on Openshift with 2 pods. I have 6 separate connectors each with max 1 task, and I typically see 3 connectors with their tasks on one pod and 3 on the other. For the example above, this was the only 1 of the 6 tasks that showed this behavior, but I have seen where 2 or 3 of them are doing it.

what caused druid tasks failed

I had set up druid cluster(10 nodes),ingestion kafka data using indexing service.However,I found many of tasks are failed like below,but some data had been existed in segments,I am not sure if all datas are pushed in the segments.
failed task lists
besides that,I choose some logs of failed tasks,found there are no fatal error messages,I posted the log file, please help me what caused the task failed.thank so much.
one log of failed tasks
there are 2 questions I want to ask,one is how to confirm all consumer data are pushed in the segments,the other is what caused the task Failure.
This looks to be the issue of Hadoop, where multiple threads trying to write to the same file at same time, you need to set overwrite=false
Check if you are running multiple ingestion tasks for same segments.
you can refer below link for further debugging it -
https://community.hortonworks.com/questions/139150/no-lease-on-file-inode-5425306-file-does-not-exist.html