How to schedule tasks in celery beat with canvas functionality - celery

Celery tasks have the limitation that they can not call subprocesses. Call backs and other canvas functionality is handled by how the tasks are called and related through the various canvas functions.
But when you schedule tasks through celery beat it appear that the only option is to call a single task without any of the canvas funcitons.
I need to schedule a task that has call backs. Is there anyway to do that in celery?
Its possible I could accomplish what I need in a single task but it would involve having significant memory overhead and take a long time to execute with many db hits. Is that a good idea?

You can just generate the signature for whatever you need and pass it in the options dictionary. Here is a working tasks.py for a simple chain:
from datetime import timedelta
from celery import Celery, signature
app = Celery('tasks', broker='redis://localhost')
app.conf.update(
beat_schedule={'add_divide': {'task': 'tasks.add',
'args': (5, 7),
'schedule': timedelta(seconds=10),
'options': {'queue': 'testq',
'link': signature('tasks.divide',
args=(4, ),
queue='testq'
)
}
}
}
)
#app.task
def add(x, y):
z = x + y
print('sum: {0}'.format(z))
return z
#app.task
def divide(x, y):
z = x / y
print('divide: {0}'.format(z))
return z
Running celery worker -A tasks.celery -Q testq --beat -c1 will output something like:
[2017-04-07 00:00:00,000: WARNING/PoolWorker-2] sum: 12
[2017-04-07 00:00:00,050: WARNING/PoolWorker-2] divide: 3

Related

How to pass the number of total users to simulate and spawn rate in the locust script itself

How do I pass the number of total users to simulate and spawn rate in the Web UI when I run the locust file, instead, I would like to pass them as variables in the script itself?
class QuickstartUser(HttpUser):
wait_time = between(1, 2.5)
users = 10
spawn_rate = 1
#task
def on_start(self):
filenumber="ABC"
# Get file info
response = self.client.get(f"/files/" + filenumber)
json_var = response.json()
print("response Json: ", json_var)
time.sleep(1)
You could probably do it by accessing the Runner in code, but it would be much easier if you used a Load Shape.
class MyCustomShape(LoadTestShape):
time_limit = 600
spawn_rate = 20
def tick(self):
run_time = self.get_run_time()
if run_time < self.time_limit:
# User count rounded to nearest hundred.
user_count = round(run_time, -2)
return (user_count, spawn_rate)
return None
tick is called automatically, you just have to return a tuple of the user count and spawn rate you want. You can do whatever work you want to calculate what the users and rate should be. There are more examples in the GitHub repo.

Twitter's Future.collect not working concurrently (Scala)

Coming from a node.js background, I am new to Scala and I tried using Twitter's Future.collect to perform some simple concurrent operations. But my code shows sequential behavior rather than concurrent behavior. What am I doing wrong?
Here's my code,
import com.twitter.util.Future
def waitForSeconds(seconds: Int, container:String): Future[String] = Future[String] {
Thread.sleep(seconds*1000)
println(container + ": done waiting for " + seconds + " seconds")
container + " :done waiting for " + seconds + " seconds"
}
def mainFunction:String = {
val allTasks = Future.collect(Seq(waitForSeconds(1, "All"), waitForSeconds(3, "All"), waitForSeconds(2, "All")))
val singleTask = waitForSeconds(1, "Single")
allTasks onSuccess { res =>
println("All tasks succeeded with result " + res)
}
singleTask onSuccess { res =>
println("Single task succeeded with result " + res)
}
"Function Complete"
}
println(mainFunction)
and this is the output I get,
All: done waiting for 1 seconds
All: done waiting for 3 seconds
All: done waiting for 2 seconds
Single: done waiting for 1 seconds
All tasks succeeded with result ArraySeq(All :done waiting for 1 seconds, All :done waiting for 3 seconds, All :done waiting for 2 seconds)
Single task succeeded with result Single :done waiting for 1 seconds
Function Complete
The output I expect is,
All: done waiting for 1 seconds
Single: done waiting for 1 seconds
All: done waiting for 2 seconds
All: done waiting for 3 seconds
All tasks succeeded with result ArraySeq(All :done waiting for 1 seconds, All :done waiting for 3 seconds, All :done waiting for 2 seconds)
Single task succeeded with result Single :done waiting for 1 seconds
Function Complete
Twitter's futures are more explicit about where computations are executed than the Scala standard library futures. In particular, Future.apply will capture exceptions safely (like s.c.Future), but it doesn't say anything about which thread the computation will run in. In your case the computations are running in the main thread, which is why you're seeing the results you're seeing.
This approach has several advantages over the standard library's future API. For one thing it keeps method signatures simpler, since there's not an implicit ExecutionContext that has to be passed around everywhere. More importantly it makes it easier to avoid context switches (here's a classic explanation by Brian Degenhardt). In this respect Twitter's Future is more like Scalaz's Task, and has essentially the same performance benefits (described for example in this blog post).
The downside of being more explicit about where computations run is that you have to be more explicit about where computations run. In your case you could write something like this:
import com.twitter.util.{ Future, FuturePool }
val pool = FuturePool.unboundedPool
def waitForSeconds(seconds: Int, container:String): Future[String] = pool {
Thread.sleep(seconds*1000)
println(container + ": done waiting for " + seconds + " seconds")
container + " :done waiting for " + seconds + " seconds"
}
This won't produce exactly the output you're asking for ("Function complete" will be printed first, and allTasks and singleTask aren't sequenced with respect to each other), but it will run the tasks in parallel on separate threads.
(As a footnote: the FuturePool.unboundedPool in my example above is an easy way to create a future pool for a demo, and is often just fine, but it isn't appropriate for CPU-intensive computations—see the FuturePool API docs for other ways to create a future pool that will use an ExecutorService that you provide and can manage yourself.)

Can I use celery for every-second task?

I'm running a task every second, and it seems celery doesn't actually perform the task every second.
I guess celery might be a good scheduler for every 1 minute task, but might not be adequte for every second task.
Here's the picture which illustrates what I mean.
I'm using the following options
'schedule': 1.0,
'args': [],
'options': {
'expires': 3
}
And I'm using celery 4.0.0
Yes, Celery actually handles times as low as 1 second, and possibly lower since it takes a float. See this entry of periodic tasks in the docs http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html:
from celery import Celery
from celery.schedules import crontab
app = Celery()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
# Calls test('hello') every 10 seconds.
sender.add_periodic_task(10.0, test.s('hello'), name='add every 10')
# Calls test('world') every 30 seconds
sender.add_periodic_task(30.0, test.s('world'), expires=10)
# Executes every Monday morning at 7:30 a.m.
sender.add_periodic_task(
crontab(hour=7, minute=30, day_of_week=1),
test.s('Happy Mondays!'),
)
#app.task
def test(arg):
print(arg)
A better written example can be found 1/3 the way down https://github.com/celery/celery/issues/3589:
# file: tasks.py
from celery import Celery
celery = Celery('tasks', broker='pyamqp://guest#localhost//')
#celery.task
def add(x, y):
return x + y
#celery.on_after_configure.connect
def add_periodic(**kwargs):
celery.add_periodic_task(10.0, add.s(2,3), name='add every 10')
So sender is the actual Celery broker, i.e. app = Celery()

Spawning more threads than you have in a gevent pool

As I understand it the idea of a pool in gevent is to limit the total number of concurrent requests at any time, to a database or an API or similar.
Say I have code like this where I am spawning more greenlets than I have room for in the Pool:
import gevent.pool
pool = gevent.pool.Pool(50)
jobs = []
for number in xrange(300):
jobs.append(pool.spawn(do_something, number))
total_result = [x.get() for x in jobs]
What is the actual behavior when trying to spawn the 51st request? When is the 51st request handled?
The pool class uses a semaphore to count active greenlets, initialized with size count in the constructor:
class Pool(Group):
def __init__(self, size=None, greenlet_class=None):
if size is not None and size < 1:
raise ValueError('Invalid size for pool (positive integer or None required): %r' % (size, ))
Group.__init__(self)
self.size = size
if greenlet_class is not None:
self.greenlet_class = greenlet_class
if size is None:
self._semaphore = DummySemaphore()
else:
self._semaphore = Semaphore(size)
Every time spawn() is called, it tries to acquire the semaphore:
def spawn(self, *args, **kwargs):
self._semaphore.acquire()
try:
greenlet = self.greenlet_class.spawn(*args, **kwargs)
self.add(greenlet)
except:
self._semaphore.release()
raise
return greenlet
If the pool is full, the called greenlet will thus wait on _semaphore.acquire() call. Semaphore is released whenever any of the greenlets ends execution:
def discard(self, greenlet):
Group.discard(self, greenlet)
self._semaphore.release()
So in your case, I'd expect the 51st request to be handled (or started, to be precise) as soon as any of the first 50 requests is done.

Retrying celery failed tasks that are part of a chain

I have a celery chain that runs some tasks. Each of the tasks can fail and be retried. Please see below for a quick example:
from celery import task
#task(ignore_result=True)
def add(x, y, fail=True):
try:
if fail:
raise Exception('Ugly exception.')
print '%d + %d = %d' % (x, y, x+y)
except Exception as e:
raise add.retry(args=(x, y, False), exc=e, countdown=10)
#task(ignore_result=True)
def mul(x, y):
print '%d * %d = %d' % (x, y, x*y)
and the chain:
from celery.canvas import chain
chain(add.si(1, 2), mul.si(3, 4)).apply_async()
Running the two tasks (and assuming that nothing fails), your would get/see printed:
1 + 2 = 3
3 * 4 = 12
However, when the add task fails the first time and succeeds in subsequent retry calls, the rest of the tasks in the chain do not run, i.e. the add task fails, all other tasks in the chain are not run and after a few seconds, the add task runs again and succeeds and the rest of the tasks in the chain (in this case mul.si(3, 4)) does not run.
Does celery provide a way to continue failed chains from the task that failed, onwards? If not, what would be the best approach to accomplishing this and making sure that a chain's tasks run in the order specified and only after the previous task has executed successfully even if the task is retried a few times?
Note 1: The issue can be solved by doing
add.delay(1, 2).get()
mul.delay(3, 4).get()
but I am interested in understanding why chains do not work with failed tasks.
You've found a bug :)
Fixed in https://github.com/celery/celery/commit/b2b9d922fdaed5571cf685249bdc46f28acacde3
will be part of 3.0.4.
I'm also interested in understanding why chains do not work with failed tasks.
I dig some celery code and what I've found so far is:
The implementation happends at app.builtins.py
#shared_task
def add_chain_task(app):
from celery.canvas import chord, group, maybe_subtask
_app = app
class Chain(app.Task):
app = _app
name = 'celery.chain'
accept_magic_kwargs = False
def prepare_steps(self, args, tasks):
steps = deque(tasks)
next_step = prev_task = prev_res = None
tasks, results = [], []
i = 0
while steps:
# First task get partial args from chain.
task = maybe_subtask(steps.popleft())
task = task.clone() if i else task.clone(args)
i += 1
tid = task.options.get('task_id')
if tid is None:
tid = task.options['task_id'] = uuid()
res = task.type.AsyncResult(tid)
# automatically upgrade group(..) | s to chord(group, s)
if isinstance(task, group):
try:
next_step = steps.popleft()
except IndexError:
next_step = None
if next_step is not None:
task = chord(task, body=next_step, task_id=tid)
if prev_task:
# link previous task to this task.
prev_task.link(task)
# set the results parent attribute.
res.parent = prev_res
results.append(res)
tasks.append(task)
prev_task, prev_res = task, res
return tasks, results
def apply_async(self, args=(), kwargs={}, group_id=None, chord=None,
task_id=None, **options):
if self.app.conf.CELERY_ALWAYS_EAGER:
return self.apply(args, kwargs, **options)
options.pop('publisher', None)
tasks, results = self.prepare_steps(args, kwargs['tasks'])
result = results[-1]
if group_id:
tasks[-1].set(group_id=group_id)
if chord:
tasks[-1].set(chord=chord)
if task_id:
tasks[-1].set(task_id=task_id)
result = tasks[-1].type.AsyncResult(task_id)
tasks[0].apply_async()
return result
def apply(self, args=(), kwargs={}, **options):
tasks = [maybe_subtask(task).clone() for task in kwargs['tasks']]
res = prev = None
for task in tasks:
res = task.apply((prev.get(), ) if prev else ())
res.parent, prev = prev, res
return res
return Chain
You can see that at the end prepare_steps prev_task is linked to the next task.
When the prev_task failed the next task is not called.
I'm testing with adding the link_error from prev task to the next:
if prev_task:
# link and link_error previous task to this task.
prev_task.link(task)
prev_task.link_error(task)
# set the results parent attribute.
res.parent = prev_res
But then, the next task must take care of both cases (maybe, except when it's configured to be immutable, e.g. not accept more arguments).
I think chain can support that by allowing some syntax likes this:
c = chain(t1, (t2, t1e), (t3, t2e))
which means:
t1 link to t2 and link_error to t1e
t2 link to t3 and link_error to t2e