Resume Celery job.apply_async.join() - celery

TL;DR: If my producer process crashes after sending some work to the consumers, how can it resume waiting for the consumers to complete their work once it restarts?
producer.py dispatches work items to a group of consumers (registered Celery tasks), like so:
from celery import group, signature
job = group(
signature(task_name, args=(x,)) for x in xrange(100)
)
group_result = job.apply_async()
group_result.join() # blocks until tasks complete
The consumers take a long time to complete, so it's possible/expected that the producer will sometimes die during the call to join(). When the producer dies, it is restarted.
When the produces restarts, is there a way to resume the join?
I'm a Celery newbie; have combed through docs and examples but haven't found an answer to this. Hoping the experts can help point me in the right direction.

For the record, my solution was to store the task ids (in some external file/db/whatever).
job = group(
signature(task_name, args=(x,)) for x in xrange(100)
)
group_result = job.apply_async()
# store group_result.results somewhere durable
persist(some_db, group_result.results)
group_result.join() # blocks until tasks complete
Now, if the join above fails, we can recover like this:
# read tasks from persistent storage
previous_pending_results = read_previously_persisted_results(some_db)
for result in previous_pending_results:
if result.status != 'SUCCESS':
...

Related

How to use Queue in concurrent.futures.ProcessPoolExecutor()?

Disclaimer: I'm new to Python in general. I have a small experience with Go, where implementing a queue using a channel is really easy.
I want to know how can I implement a Queue with ProcessPoolExecutor in Python 3.
I want my N number of process to access a single queue so that I can just insert many jobs in the queue via the main thread, then the processes will just grab the jobs in the Queue.
Or if there is a better way to share a list/queue between multiple processes. (Job Queue/ Worker Pool maybe ?)
Thanks.
concurrent.futures does this for you. The executor object implements a queue internally, so when you submit tasks, they get put into the queue and your worker threads or worker processes pick jobs up and run them.
It may feel as though it's "too easy", but that's what concurrent.futures is all about - abstracting away all the complexity of managing threadpools or processpools, job queues, etc. so you can trade a little overhead for a lot of saved time and effort.
Here's what it looks like:
import concurrent.futures
def send_email(from, to, subject, message):
# magic to send an email
executor = concurrent.futures.ProcessPoolExecutor()
future = executor.submit(send_email, 'me#example.com', 'you#example.com', 'Hi!', 'Nice to meet you')
That one simple submit call takes your function and its arguments, wraps them into a work item, puts them into a queue and the process pool that was initialised when you created your executor will pick off the jobs and run them.

Distributed queue consumers in an unstable net

I'm working on the design of a distributed system. The system consists of multiple producers, distributed queue and multiple consumers aka workers.
Workers instances resides within datacentres in different locations. Sometimes one location is manually disconnected.
In such a case, the issue is the worker from the disconnected location got some task from the queue and is then shutting down before task completion. I want:
workers from an alive location be able to got such a task and complete it
when a disconnected worker finally turns on, it should determine if the task was already completed by another worker and decide what to do with it
What is a convenient way to solve such an issue?
This design might help you. Every time a worker consumes a task, move the task from queue to some other distributed list of consumed tasks. In this list of tasks, maintain a timestamp with every task.
Then the worker that consumed the task should send some kind of still alive message every second or so (similar to Hadoop's hearbeat message) that updates the timestamp of a task in consumed tasks list. This is to indicate that the worker who consumed this task is still alive and received a message from him recently.
Now, implement a daemon to monitor this consumed tasks list and move the tasks back to queue whose timestamp is older than a threshold number of seconds (considering message losses).

Celery chain's place of passing arguments

1 ) Celery chain.
On the doc I read this:
Here’s a simple chain, the first task executes passing its return value to the next task in the chain, and so on.
>>> from celery import chain
>>> # 2 + 2 + 4 + 8
>>> res = chain(add.s(2, 2), add.s(4), add.s(8))()
>>> res.get()
16
But where exactly is chain item's result passed to next chain item? On the celery server side, or it passed to my app and then my app pass it to the next chain item?
It's important to me, because my results is quite big to pass them to app, and I want to do all this messaging right into celery server.
2 ) Celery group.
>>> g = group(add.s(i) for i in xrange(10))
>>> g(10).get()
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Can I be sure that these tasks will be executed as much as possible together. Will celery give priority certain group since the first task of the group start to be being executed?
For example I have 100 requests and each request run group of task, and I don't want to mix task from different groups between each other. First started request to be processed can be the last completed, while his the last task are waiting for free workers which are busy with tasks from others requests. It seems to be better if group of task will be executed as much as possible together.
I will really appreciate if you can help me.
1. Celery Chain
Results are passed on celery side using message passing broker such as rabbitmq. Result are stored using result backend(explicitly required for chord execution). You could verify this information by running your celery worker with loglevel 'INFO' and identify how tasks are invoked.
Celery maintains dependency graph once you invoke tasks, so it exactly knows how to chain your tasks.
Consider callbacks where you link two different tasks,
http://docs.celeryproject.org/en/latest/userguide/canvas.html#callbacks
2. Celery Group
When you call tasks in group celery executes(invokes) them in parallel. Celery worker will try to pick up them depending upon workload it can pick up. If you invoke large number of tasks than your worker can handle, it is certainly possible your first few tasks will get executed first then celery worker will pick rest gradually.
If you have very large no. of task to be invoked in parallel better to invoke then in chunks of certain pool size,
You can mention priority of tasks as mentioned in answer
Completion of tasks in group depends on how much time each task takes. Celery tries to do fair task scheduling as much as possible.

Celery Group task for use in a map/reduce workflow

Can I use a Celery Group primitive as the umbrella task in a map/reduce workflow?
Or more specific: Can the subtasks in a Group be run on multiple workers on multiple servers?
From the docs:
However, if you call apply_async on the group it will send a special
grouping task, so that the action of calling the tasks happens in a worker
instead of the current process
That seems to imply the tasks are all send to one worker...
Before 3.0 (and still) one could fire off the subtasks in a TaskSet which would run on multiple servers. The problem is determining whether all tasks have finished executing. That is normally done by polling all subtasks which is not really elegant.
I am wondering if the Group primitive can be used to mitigate this problem.
I found out it is possible to use Chords for such a map reduce like problem.
#celery.task(name='ic.mapper')
def mapper():
#split your problem in embarrassingly parallel maps
maps = [map.s(), map.s(), map.s(), map.s(), map.s(), map.s(), map.s(), map.s()]
#and put them in a chord that executes them in parallel and after they finish calls 'reduce'
mapreduce = celery.chord(maps)(reduce.s())
return "{0} mapper ran on {1}".format(celery.current_task.request.id, celery.current_task.request.hostname)
#celery.task(name='ic.map')
def map():
#do something useful here
import time
time.sleep(10.0)
return "{0} map ran on {1}".format(celery.current_task.request.id, celery.current_task.request.hostname)
#celery.task(name='ic.reduce')
def reduce(results):
#put the maps together and do something with the results
return "{0} reduce ran on {1}".format(celery.current_task.request.id, celery.current_task.request.hostname)
When the mapper is executed on a cluster of three workers/servers it first executes the mapper which splits your problem and the creates new subtasks that are again submitted to the broker. These run in parallel because the queue is consumed by all brokers. Also an chord task is created that polls all maps to see if they have finished. When done the reduce task is executed where you can glue your results back together.
In all: yes it is possible. Thanks for the vegetable guys!

End Celery worker task on, time limit, job stage or instruction from client

I'm new to celery and I would appreciate a little help with a design pattern(or example code) for a worker I have yet to write.
Below is a description of the desired characteristics of the worker.
The worker will run a task that collects data from an endless source, a generator.
The worker task will run forever feeding from the generator unless it is directed to stop.
The worker task should stop gracefully on the occurrence of any one of the following triggers.
It exceeds an execution time limit in seconds.
It exceeds a number of iterations of the endless generator loop.
The client sends a message instructing the worker task to finish immediately.
Below is some sudo code for how I believe I need to handle trigger scenarios 1 and 2.
What I don't know is how I send the 'finish immediately' signal from the client and how it is received and executed in the worker task.
Any advice or sample code would be appreciated.
from celery.task import task
from celery.exceptions import SoftTimeLimitExceeded
COUNTLIMIT = # some value sent to the worker task by the client
#task()
def getData():
try:
for count, data in enumerate(endlessGeneratorThing()):
# process data here
if count > COUNTLIMIT: # Handle trigger scenario 2
clean_up_task_nicely()
break
except SoftTimeLimitExceeded: # Handle trigger scenario 1
clean_up_task_nicely()
My understanding of revoke is that it only revokes a task prior to its execution. For (3), I think what you want to do is use an AbortableTask, which provides a cooperative way to end a task:
http://docs.celeryproject.org/en/latest/reference/celery.contrib.abortable.html
On the client end you are able to call task.abort(), on the task end, you are able to poll task.is_aborted()