How to safely restart Airflow and kill a long-running task? - celery

I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator.
My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens:
Associated task becomes a zombie (running state, but no process with heartbeat)
Task is marked as failed when Airflow reaps zombies
Spark streaming job continues to run
How can I force the worker to kill my Spark job before it shuts down?
I've tried killing the Celery worker with a TERM signal, but apparently that causes Celery to stop accepting new tasks and wait for current tasks to finish (docs).

You need to be more clear about the issue. If you are saying that the spark cluster finishes the jobs as expected and not calling the on_kill function, it's expected behavior. As per the docs on kill function is for cleaning up after task get killed.
def on_kill(self) -> None:
"""
Override this method to cleanup subprocesses when a task instance
gets killed. Any use of the threading, subprocess or multiprocessing
module within an operator needs to be cleaned up or it will leave
ghost processes behind.
"""
In your case when you manually kill the job it is doing what it has to do.
Now if you want to have a clean_up even after successful completion of the job, override post_execute function.
As per the docs. The post execute is
def post_execute(self, context: Any, result: Any = None):
"""
This hook is triggered right after self.execute() is called.
It is passed the execution context and any results returned by the
operator.
"""

Related

ADF Scheduling when existing Job not yet finished

Having read https://learn.microsoft.com/en-us/azure/data-factory/v1/data-factory-scheduling-and-execution, it is unclear to me if:
A schedule is made every hr for a job to run,
can we stop the concurrent execution of the next job at hr+1 if the job for hr+0 is still running?
It looks if concurrency = 1 means this,
But is that invocation simply not start until concurrent execution is finished?
Or will it be discarded?
When we set the concurrency 1, only one instance will be allowed to run at a time. When the scheduled trigger runs again and tries to run the pipeline, If the pipeline is already running, the next invocation will be queued. It will start after finishing the current instance.
For your question, the following invocation will be queued. After the first run finishes, the next run will start.

How to prevent celery.backend_cleanup from executing in default queue

I am using python + flask + SQS and I'm also using celery beat to execute some scheduled tasks.
Recently I went from having one single default "celery" queue to execute all my tasks to having dedicated queues/workers for each task. This includes tasks scheduled by celery beat which now all go to a queue named "scheduler".
Before dropping the "celery" queue, I monitored it to see if any tasks would wind up in that queue. To my surprise, they did.
Since I had no worker consuming from that queue, I could easily inspect the messages which piled up using the AWS console. What is saw was that all tasks were celery.backend_cleanup!!!
I cannot find out from the celery docs how do I prevent this celery.backend_cleanup from getting tossed into this default "celery" queue which I want to get rid of! And the docs on beat do not show an option to pass a queue name. So how do I do this?
This is how I am starting celery beat:
/venv/bin/celery -A backend.app.celery beat -l info --pidfile=
And this is how I am starting the worker
/venv/bin/celery -A backend.app.celery worker -l info -c 2 -Ofair -Q scheduler
Keep in mind, I don't want to stop backend_cleanup from executing, I just want it to go in whatever queue I specify.
Thanks ahead for the assistance!
You can override this in the beat task setup. You could also change the scheduled time to run here if you wanted to.
app.conf.beat_schedule = {
'backend_cleanup': {
'task': 'celery.backend_cleanup',
'options': {'queue': <name>,
'exchange': <name>,
'routing_key': <name>}
}
}

Datastage: How to keep continuous mode job running after a unexpected termination

I have a job that uses the Kafka Connector Stage in order to read a Kafka queue and then load into the database. That job runs in Continuous Mode, which it has no time to conclude, since it keeps monitoring the Kafka queue in real time.
For unexpected reasons (say, server issues, job issues etc) that job may terminate with failure. In general, that happens after 300 running hours of that job. So, in order to keep the job alive I have to manually look to the job status and then to do a Reset and Run, in order to keep the job running.
The problem is that between the job termination and my manual Reset and Run can pass several hours, which is critical. So I'm looking for a way to eliminate the manual interaction and to reduce that gap by automating the job invocation.
I tried to use Control-M to daily run the job, but with no success: The first day the Control-M called the job, it ran it fine. But in the next day, when the Control-M did an attempt to instantiate the job again it failed (since it was already running). Besides, the Datastage will never tell back Control-M that a job was successfully concluded, since the job's nature won't allow that.
Said that, I would like to hear ideas from you that can light me up.
The first thing that came in mind is to create a intermediate Sequence and then schedule it in Control-M. Then, this new Sequence would call the continuous job asynchronously by using command line stage.
For the case where just this one job terminates unexpectedly and you want it to be restarted as soon as possible, have you considered calling this job from a sequence? The sequence could be setup to loop running this job.
Thus sequence starts job and waits for it to finish. When job finishes, the sequence will then loop and start the job again. You could have added conditions on job exit (for example, if the job aborted, then based on that job end status, you could reset the job before re-running it.
This would not handle the condition where the DataStage engine itself was shut down (such as for maintenance or possibly an error) in which case all jobs end including your new sequence. The same also applies for a server reboot or other situations where someone may have inadvertently stopped your sequence. For those cases (such as DataStage engine stop) your team would need to have process in place for jobs/sequences that need to be started up following a DataStage or System outage.
For the outage scenario, you could create a monitor script (regardless of whether running the job solo or from sequence) that sleeps/loops on 5-10 minute intervals and then checks the status of your job using dsjob command, and if not running can start that job/sequence (also via dsjob command). You can decide whether that script startup would occur at DataSTage startup, machine startup, or run it from Control M or other scheduler.

Airflow: what do `airflow webserver`, `airflow scheduler` and `airflow worker` exactly do?

I've been working with Airflow for a while now, which was set up by a colleague. Lately I run into several errors, which require me to more in dept know how to fix certain things within Airflow.
I do understand what the 3 processes are, I just don't understand the underlying things that happen when I run them. What exactly happens when I run one of the commands? Can I somewhere see afterwards that they are running? And if I run one of these commands, does this overwrite older webservers/schedulers/workers or add a new one?
Moreover, if I for example run airflow webserver, the screen shows some of the things that are happening. Can I simply get out of this by pressing CTRL + C? Because when I do this, it says things like Worker exiting and Shutting down: Master. Does this mean I'm shutting everything down? How else should I get out of the webserver screen then?
Each process does what they are built to do while they are running (webserver provides a UI, scheduler determines when things need to be run, and workers actually run the tasks).
I think your confusion is that you may be seeing them as commands that tell some sort of "Airflow service" to do something, but they are each standalone commands that start the processes to do stuff. ie. Starting from nothing, you run airflow scheduler: now you have a scheduler running. Run airflow webserver: now you have a webserver running. When you run airflow webserver, it is starting a python flask app. While that process is running, the webserver is running, if you kill command, is goes down.
All three have to be running for airflow as a whole to work (assuming you are using an executor that needs workers). You should only ever had one scheduler running, but if you were to run two processes of airflow webserver (ignoring port conflicts, you would then have two separate http servers running using the same metadata database. Workers are a little different in that you may want multiple worker processes running so you can execute more tasks concurrently. So if you create multiple airflow worker processes, you'll end up with multiple processes taking jobs from the queue, executing them, and updating the task instance with the status of the task.
When you run any of these commands you'll see the stdout and stderr output in console. If you are running them as a daemon or background process, you can check what processes are running on the server.
If you ctrl+c you are sending a signal to kill the process. Ideally for a production airflow cluster, you should have some supervisor monitoring the processes and ensuring that they are always running. Locally you can either run the commands in the foreground of separate shells, minimize them and just keep them running when you need them. Or run them in as a background daemon with the -D argument. ie airflow webserver -D.

Quartz scheduler shutdown(true) wait for all threads started from running Jobs to stop?

If I have a job and from that job I create some threads, what happens when I call scheduler.shutdown(true)?
Will the scheduler wait for all of my threads to finish or not?
Quartz 1.8.1 API docs:
Halts the Scheduler's firing of Triggers, and cleans up all resources associated with the Scheduler.
Parameters:
waitForJobsToComplete - if true the scheduler will not allow this method to return until all currently executing jobs have completed.
Quarts neither know nor cares about any threads spawned by your job, it will simply wait for the job to complete. If your job spawns new threads then exits, then as far as Quartz is concerned, it's finished.
If your job needs to wait for its spawned threads to complete, then you need to use something like an ExecutorService (see javadoc for java.util.concurrent), which will allow the job thread to wait for its spawned threads to complete. If you're using raw java threads, then use Thread.join().