Unable to revoke a celery task when workers are not started

Unable to revoke a celery task when workers are not started - celery

I am using celery task queue.
When I start a task when no workers are started, I am unable to revoke it without starting the workers.
When I run this command revoke(terminate=True, signal='SIGKILL'). It returns state as PENDING.
I want to know if it is possible to revoke a task which I've started when there are no workers started?
I've tried persistent revokes but it doesn't seem to work in this specific case.

Related

Celery beat database scheduler; how to check in every available database schema for running

How to run celery beat with data base scheduler from different schema. Like I have a a multi tenant application in which each tenant has its own schema. Celery beat is running only if I had saved the scheduled task in the public schema and it is not running if I had saved the scheduled task in other schema.
Is there is any way in which celery beat to check in all schemas available and run accordingly

spring-batch job monitoring and restart

I am new to spring-batch, got few questions:-
I have got a question about the restart. As per documentation, the restart feature is enabled by default. What I am not clear is do I need to do any extra code for a restart? If so, I am thinking of adding a scheduled job that looks at failed processes and restarts them?
I understand spring-batch-admin is deprecated. However, we cannot use spring-cloud-data-flow right now. Is there any other alternative to monitor and restart jobs on demand?

The restart that you mention only means if a job is restartable or not .It doesn't mean Spring Batch will help you to restart the failed job automatically.
Instead, it provides the following building blocks for developers for achieving this task on their own :
JobExplorer to find out the id of the job execution that you want to restart
JobOperator to restart a job execution given a job execution id
Also , a restartable job can only be restarted if its status is FAILED. So if you want to restart a running job that was stop running because of the server breakdown , you have to first find out this running job and update its job execution status and all of its task execution status to FAILED first in order to restart it. (See this for more information). One of the solution is to implement a SmartLifecycle which use the above building blocks to achieve this goal.

How to give a job manager task permissions to resize the pool?

I'm running embarrassingly parallel workloads, but the number of parallel tasks not known beforehand. Instead, my job manager task performs simple computation to determine the number of parallel tasks and then adds the tasks to the job.
Now, as soon as I know the number of parallel tasks I would like to immediately resize the pool I'm running in accordingly (I am running the job in an auto-pool). Here is how I try do this.
When I create the JobManagerTask I supply
...
authentication_token_settings=AuthenticationTokenSettings(
access=[AccessScope.job]),
...
At run time the task receives AZ_BATCH_AUTHENTICATION_TOKEN in environment, uses it to create BatchServiceClient, uses the client to add worker tasks to the job and ultimately calls client.pool.resize() to increase target_dedicated_nodes. At this stage the task gets an error from the service:
.../site-packages/azure/batch/operations/_pool_operations.py", line 1310, in resize
raise models.BatchErrorException(self._deserialize, response)
azure.batch.models._models_py3.BatchErrorException: Request encountered an exception.
Code: PermissionDenied
Message: {'additional_properties': {}, 'lang': 'en-US', 'value': 'Server failed to authorize the request.\nRequestId:4b34d8e5-7c28-4af2-9e1f-9cf88a486511\nTime:2020-11-26T17:32:55.7673310Z'}
AuthenticationErrorDetail: The supplied authentication token does not have permission to call the requested Url.
How can I give the task permission to resize the pool?

Currently the AZ_BATCH_AUTHENTICATION_TOKEN is limited to permissions immediately with the job. The pool ends up being a separate resource even in the auto-pool configuration so is not modifiable with the token.
There are two main approaches you can take. You can either add a certificate to your account and add it to your pool allowing you to authenticate with a ServicePrincipal with permissions to your account, or you could set your pool to autoscale depending on the number of pending tasks which doesn't get immediate resize options instead doing them at set intervals as needed.

Gracefully update running celery pod in Kubernetes

I have a Kubernetes cluster running Django, Celery, RabbitMq and Celery Beat. I have several periodic tasks spaced out throughout the day (so as to keep server load down). There are only a few hours when no tasks are running, and I want to limit my rolling-updates to those times, without having to track it manually. So I'm looking for a solution that will allow me to fire off a script or task of some sort that will monitor the Celery server, and trigger a rolling update once there's a window in which no tasks are actively running. There are two possible ways I thought of doing this, but I'm not sure which is best, nor how to implement either one.
Run a script (bash or otherwise) that checks up on the Celery server every few minutes, and initiates the rolling-update if the server is inactive
Increment the celery app name before each update (in the Beat run command, the Celery run command, and in the celery.py config file), create a new Celery pod, rolling-update the Beat pod, and then delete the old Celery 12 hours later (a reasonable time span for all running tasks to finish)
Any thoughts would be greatly appreciated.

How to ensure revoked Celery task never run after all worker process go down and come back

Here's the use case.
Day 1: Invoke a celery task with countdown 7 days from now
Day 2: Revoke this task
Day 3: Upgrade happens, so all worker processes are down and then come back up again in some time
I have tested similar scenario, I figured out that there is a revoke list for processes that are revoked in all worker processes. But the message (corresponding to task) remains in that worker process to which the task is delegated. So once all worker processes go down, the revoke list information is lost too.
I want to understand if that's the case, then after all workers come back up, then wouldn't that process start executing without getting cancelled/revoked? I am saying so because the revoke list information resides (from what I feel) only in worker processes, and not in broker.
Can some one please confirm this behavior?

You're correct - Celery workers keep the list of revoked tasks in-memory and if all workers are restarted, the list disappears. Quoting the Celery user guide on workers:
Revoking tasks works by sending a broadcast message to all the workers, the workers then keep a list of revoked tasks in memory. When a worker starts up it will synchronize revoked tasks with other workers in the cluster.
The list of revoked tasks is in-memory so if all workers restart the list of revoked ids will also vanish. If you want to preserve this list between restarts you need to specify a file for these to be stored in by using the –statedb argument to celery worker:
For more information, see the section on Persistent Revokes in the Celery User Guide.