Airflow example DAGs take a long time in GUI and the scheduler is showing this: DagFileProcessorManager (PID=...) last sent a heartbeat - postgresql

While running standard Airflow examples with airflow 2.1.2, DAGs are taking a long time to complete. On every DAG run, this problem occurs. The problem happens when running from the airflow GUI. It isn't a problem when running as a test from the airflow command line. Looking at scheduler log as it runs, this is what is apparent: after a task runs, apparently the DagFileProcessorManager has to be restarted for it to continue to the next tasks, which take 1 to 2 minutes. The restart happens after the absence of heartbeat responses, and this error shows:
{dag_processing.py:414} ERROR - DagFileProcessorManager (PID=67503) last sent a heartbeat 64.25 seconds ago! Restarting it
Question: How can I fix this?

This fixed the problem:
(1) Use postgresql instead of sqlite.
(2) Switch from SequentialExecutor to LocalExecutor.

Just to add to that - we had other similar reports and we decided to make a very clear warning in such case in the UI (will be released in the next version):
https://github.com/apache/airflow/pull/17133

Related

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

Abort a Datastage job at a specified time

I have a scheduled parallel Datastage (11.7) job.
This job has a Hive Connector with a Before and After Statement.
The before statement run ok but After statement remains in running state for several hours (on Hue Log i see this job finished in 1hour) and i have to manually abort it on Datastage Director.
Is there the way to "program an abort"?
For example i want schedule the interruption of the running job every morning at 6.
I hope I was clear :)
Even though you can kill the job - as per other responses - using dsjob to stop the job, this may have no effect because the After statement has been issued synchronously; the job is waiting for it to finish, and (probably) not processing kill signals and the like in the meantime. You would be better advised to work out why the After command is taking too long, and addressing that.

Datastage: How to keep continuous mode job running after a unexpected termination

I have a job that uses the Kafka Connector Stage in order to read a Kafka queue and then load into the database. That job runs in Continuous Mode, which it has no time to conclude, since it keeps monitoring the Kafka queue in real time.
For unexpected reasons (say, server issues, job issues etc) that job may terminate with failure. In general, that happens after 300 running hours of that job. So, in order to keep the job alive I have to manually look to the job status and then to do a Reset and Run, in order to keep the job running.
The problem is that between the job termination and my manual Reset and Run can pass several hours, which is critical. So I'm looking for a way to eliminate the manual interaction and to reduce that gap by automating the job invocation.
I tried to use Control-M to daily run the job, but with no success: The first day the Control-M called the job, it ran it fine. But in the next day, when the Control-M did an attempt to instantiate the job again it failed (since it was already running). Besides, the Datastage will never tell back Control-M that a job was successfully concluded, since the job's nature won't allow that.
Said that, I would like to hear ideas from you that can light me up.
The first thing that came in mind is to create a intermediate Sequence and then schedule it in Control-M. Then, this new Sequence would call the continuous job asynchronously by using command line stage.
For the case where just this one job terminates unexpectedly and you want it to be restarted as soon as possible, have you considered calling this job from a sequence? The sequence could be setup to loop running this job.
Thus sequence starts job and waits for it to finish. When job finishes, the sequence will then loop and start the job again. You could have added conditions on job exit (for example, if the job aborted, then based on that job end status, you could reset the job before re-running it.
This would not handle the condition where the DataStage engine itself was shut down (such as for maintenance or possibly an error) in which case all jobs end including your new sequence. The same also applies for a server reboot or other situations where someone may have inadvertently stopped your sequence. For those cases (such as DataStage engine stop) your team would need to have process in place for jobs/sequences that need to be started up following a DataStage or System outage.
For the outage scenario, you could create a monitor script (regardless of whether running the job solo or from sequence) that sleeps/loops on 5-10 minute intervals and then checks the status of your job using dsjob command, and if not running can start that job/sequence (also via dsjob command). You can decide whether that script startup would occur at DataSTage startup, machine startup, or run it from Control M or other scheduler.

task executed slowly after celery serving for long time but still work quickly after celery restart

This kind of issues occurs several times
The log shows that when celery serves for a long time, there was a stuck about 50s. I dont know what happened in that "break"
But it works properly after restarting celery and indeed this task only consumes less than 10s
I suppose there was something overloaded in the backend of celery then be released after restart.
But I have no idea what's the real problem and hardly come up with such issue

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.