Revoke multiple celery tasks - celery

I want to revoke multiple tasks from celery.
From official docs, they suggested below approach.
Is there any limit for this? Because i will get around 10k or 100k tasks to revoked.
>>> app.control.revoke([
... '7993b0aa-1f0b-4780-9af0-c47c0858b3f2',
... 'f565793e-b041-4b2b-9ca4-dca22762a55d',
... 'd9d35e03-2997-42d0-a13e-64a66b88a618',
])

There is not a limit. But please note the limitations around task revocation. In particular, this list of task IDs will be kept in memory or on disk, so there may be some amount of memory or disk overhead.

There is an undocumented limit in max 50000 flagged as revoked tasks stored in worker memory (as min for Celery 4.3 and 5.2)
Meaning that you can't handle more than 50000 flagged as revoked tasks per worker. Running Celery cluster will not help also.
While task discarded it removed from celery.worker.state.revoked and decrement flagged as revoked counter by one.
Related Celery source code celery.worker.state
Experimental approval:
Run celery worker [optionally with --statedb path to create a file]:
$ celery worker --statedb=STATE_DB_FILEPATH
Revoke 60000 tasks (by generated ids) in celery shell:
$ celery shell
>> from uuid import uuid4
>> REVOKE_TASK_NUMBER = 60000
>> celery.control.revoke([str(uuid4()) for i in range(REVOKE_TASK_NUMBER)])
Validate that worker stores only 50000 revoked ids:
$ celery inspect revoked | wc -l
50003
Shutdown the worker and validate that statedb stores only 50000 records (if created clean):
import zlib
import shelve
from kombu.serialization import pickle, pickle_protocol
filename=STATE_DB_FILEPATH
db = shelve.open(filename, protocol=pickle_protocol)
data = pickle.loads(zlib.decompress(db['zrevoked']))
print(len(data))
>> 50000

Related

How can I increase the max num of concurrent jobs in Dataproc?

I need to run hundreds of concurrent jobs in a Dataproc cluster, each job is pretty lightweight (e.g., a Hive query which gets a table metadata) which doesn't take much resources. But there seem to be some unknown factors which limit the max concurrent jobs. What can I do if I want to increase the max concurrency limit?
If you are submitting the jobs through the Dataproc API / CLI, these are the factors which affect the max number of concurrent jobs:
The number of master nodes;
The master memory size;
The cluster properties dataproc:agent.process.threads.job.max and dataproc:dataproc.scheduler.driver-size-mb, see Dataproc Properties for more details.
For debugging, when submitting jobs with gcloud, SSH into the master node and run ps aux | grep dataproc-launcher.py | wc -l every a few seconds to show how many concurrent jobs are running. At the same time, you can run tail -f /var/log/google-dataproc-agent.0.log to monitor how the agent is launching the jobs. You can tune the parameters above to get a higher concurrency.
You can also try submitting the jobs directly from the master node through spark-submit or Hive beeline, which will bypass the Dataproc job concurrency control mechanism. This can help you identify where the bottleneck is.

How to have celery expire results when using a database backend

I'm not sure I understand how result_expires works.
I read,
result_expires
Default: Expire after 1 day.
Time (in seconds, or a timedelta object) for when after stored task tombstones will be deleted.
A built-in periodic task will delete the results after this time (celery.backend_cleanup), assuming that celery beat is enabled. The task runs daily at 4am.
...
When using the database backend, celery beat must be running for the results to be expired.
(from here: http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-result_expires)
So, in order for this to work, I have to actually do something like this:
python -m celery -A myapp beat -l info --detach
?
Is that what the documentation is referring to by "celery beat is enabled"? Or, rather than executing this manually, there is some configuration that needs to be set which would cause celery beat to be called automatically?
Re: celery beat--you are correct. If you use a database backend, you have to run celery beat as you posted in your original post. By default celery beat sets up a daily task that will delete older results from the results database. If you are using a redis results backend, you do not have to run celery beat. How you choose to run celery beat is up to you, personally, we do it via systemd.
If you want to configure the default expiration time to be something other than the default 1 day, you can use the result_expires setting in celery to set the number of seconds after a result is recorded that it should be deleted. e.g., 1800 for 30 minutes.

What if i schedule tasks for celery to perform every minute and it is not able to complete it in time?

If I schedule the task for every minute and if it is not able to be getting completed in the time(one minute). Would the task wait in queue and it will go on like this? if this happens then after few hours it will be overloaded. Is there any solution for this kind of problems?
I am using beat and worker combination for this. It is working fine for less records to perform tasks. but for large database, I think this could cause problem.
Task is assign to queue (RabbitMQ for example).
Workers are queue consumers, more workers (or worker with high concurrency) - more tasks could be handled in parallel.
Your periodic task produce messages of the same type (I guess) and your celery router route them to the same queue.
Just set your workers to consume messages from that queue and that's all.
celery worker -A celeryapp:app -l info -Q default -c 4 -n default_worker#%h -Ofair
In the example above I used -c 4 for concurrency of four (eqv. to 4 consumers/workers). You can also start move workers and let them consume from the same queue with -Q <queue_name> (in my example it's default queue).
EDIT:
When using celery (the worker code) you are initiate Celery object. In Celery constructor you are setting your broker and backend (celery used them as part of the system)
for more info: http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#application

How to ensure revoked Celery task never run after all worker process go down and come back

Here's the use case.
Day 1: Invoke a celery task with countdown 7 days from now
Day 2: Revoke this task
Day 3: Upgrade happens, so all worker processes are down and then come back up again in some time
I have tested similar scenario, I figured out that there is a revoke list for processes that are revoked in all worker processes. But the message (corresponding to task) remains in that worker process to which the task is delegated. So once all worker processes go down, the revoke list information is lost too.
I want to understand if that's the case, then after all workers come back up, then wouldn't that process start executing without getting cancelled/revoked? I am saying so because the revoke list information resides (from what I feel) only in worker processes, and not in broker.
Can some one please confirm this behavior?
You're correct - Celery workers keep the list of revoked tasks in-memory and if all workers are restarted, the list disappears. Quoting the Celery user guide on workers:
Revoking tasks works by sending a broadcast message to all the workers, the workers then keep a list of revoked tasks in memory. When a worker starts up it will synchronize revoked tasks with other workers in the cluster.
The list of revoked tasks is in-memory so if all workers restart the list of revoked ids will also vanish. If you want to preserve this list between restarts you need to specify a file for these to be stored in by using the –statedb argument to celery worker:
For more information, see the section on Persistent Revokes in the Celery User Guide.

In celery, what would be the purpose of having multiple workers process the same queue?

In the documentation for celeryd-multi, we find this example:
# Advanced example starting 10 workers in the background:
# * Three of the workers processes the images and video queue
# * Two of the workers processes the data queue with loglevel DEBUG
# * the rest processes the default' queue.
$ celeryd-multi start 10 -l INFO -Q:1-3 images,video -Q:4,5 data
-Q default -L:4,5 DEBUG
( From here: http://docs.celeryproject.org/en/latest/reference/celery.bin.celeryd_multi.html#examples )
What would be a practical example of why it would be good to have more than one worker on a single host process the same queue, as in the above example? Isn't that what setting the concurrency is for?
More specifically, would there be any practical difference between the following two lines (A and B)?:
A:
$ celeryd-multi start 10 -c 2 -Q data
B:
$ celeryd-multi start 1 -c 20 -Q data
I am concerned that I am missing some valuable bit of knowledge about task queues by my not understanding this practical difference, and I would greatly appreciate if somebody could enlighten me.
Thanks!
What would be a practical example of why it would be good to have more
than one worker on a single host process the same queue, as in the
above example?
Answer:
So, you may want to run multiple worker instances on the same machine
node if:
You're using the multiprocessing pool and want to consume messages in parallel. Some report better performance using multiple worker
instances instead of running a single instance with many pool
workers.
You're using the eventlet/gevent (and due to the infamous GIL, also the 'threads') pool), and you want to execute tasks on multiple CPU
cores.
Reference: http://www.quora.com/Celery-distributed-task-queue/What-is-the-difference-between-workers-and-processes