Use current Celery task in chord? - celery

I'd like to run tasks in parallel that have a data dependency at the beginning of the first task. It seems that I should be able to start a chord with the current task in the header group that's used as the args for the body callback. I don't see a way to reference the signature of the current task in the documentation, but is there a way to do this?
I was thinking it would be something like this with the get_signature() being the missing piece:
#app.task(bind=True)
def chord_test(self, id_) -> int:
data, next_id = get_data(id_)
chord([self.get_signature(), chord_test.s(next_id)])(handle_results.s())
return expensive_processing(data)

Related

Celery canvas behavior differs between async and eager mode

There are some discrepancies on the way the Celery canvas works in async and eager mode. I've noticed that a group followed by a chain in a dynamic task that replaces itself does not send the results along to the next on the chain.
Well, that seems complicated, let me show an example:
Given the following task:
#shared_task(bind=True)
def grouped(self, val):
task = (
group(asum.s(val, n) for n in range(val)) | asum.s(val)
)
raise self.replace(task)
when it's grouped in another canvas like this:
#shared_task(bind=True)
def flow(self, val):
workflow = (asum.s(1, val) |
asum.s(2) |
grouped.s() |
amul.s(3))
return self.replace(workflow)
the task amul will not receive the results from grouped when in eager mode.
To really ilustrate the issue, I've created a sample project on github where you can dive in into problem and help-me out with some quick solutions and possibly, some PR's on the celery project.
https://github.com/gutomaia/celery_equation
---- edited ----
On the project, I state the different behavior in both ways of using celery. In async mode, thouse tasks works as expected.
>>> from equation.main import *
>>> from equation.tasks import *
>>> flow.delay(1).get()
78
>>> flow.delay(2).get()
120
>>> flow.delay(100).get()
47895
I was struggling with this situation in a test case. For future readers, at least as of celery 4.4.0, the following idiom will work in all contexts, including synchronous, in-process execution:
return self.replace(...)
Using raise or simply letting the function end right after Task.replace will only work in asynchronous mode. The relevant code is right at the end of Task.replace:
if self.request.is_eager:
return sig.apply().get()
else:
sig.delay()
raise Ignore('Replaced by new task')
Sadly, eager mode will never be the same as running an actual worker. There's too many intricacies while running an actual worker for eager mode to be the exact same thing.
I agree that things like this should fall into special cases when using eager mode but some discrepancy is expected.
Please submit a PR if you know how to fix this issue and we can review the fix there. Thank you!
grouped() is not returning anything, so how do you expect amul to get the result??

How to use delta trigger in flink?

I want to use the deltatrigger in apache flink (flink 1.3) but I have some trouble with this code :
.trigger(DeltaTrigger.of(100, new DeltaFunction[uniqStruct] {
override def getDelta(oldFp: uniqStruct, newFp: uniqStruct): Double = newFp.time - oldFp.time
}, TypeInformation[uniqStruct]))
And I have this error:
error: object org.apache.flink.api.common.typeinfo.TypeInformation is not a value [ERROR] }, TypeInformation[uniqStruct]))
I don't understand why DeltaTrigger need TypeSerializer[T]
and I don't know what to do to remove this error.
Thanks a lot everyone.
I would read into this a bit https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/types_serialization.html sounds like you can create a serializer using typeInfo.createSerializer(config) on your type info. Note what you're passing in currently is a type itself and NOT the type info which is why you're getting the error you are.
You would need to do something more like
val uniqStructTypeInfo: TypeInformation[uniqStruct] = createTypeInformation[uniqStruct]
val uniqStrictTypeSerializer = typeInfo.createSerializer(config)
To quote the page above regarding the config param you need to pass to create serializer
The config parameter is of type ExecutionConfig and holds the
information about the program’s registered custom serializers. Where
ever possibly, try to pass the programs proper ExecutionConfig. You
can usually obtain it from DataStream or DataSet via calling
getExecutionConfig(). Inside functions (like MapFunction), you can get
it by making the function a Rich Function and calling
getRuntimeContext().getExecutionConfig().
DeltaTrigger needs a TypeSerializer because it uses Flink's managed state mechanism to store each element for later comparison with the next one (it just keeps one element, the last one, which is updated as new elements arrive).
You will find an example (in Java) here.
But if all you need is a window that triggers every 100msec, then it'll be easier to just use a TimeWindow, such as
input
.keyBy(<key selector>)
.timeWindow(Time.milliseconds(100)))
.apply(<window function>)
Updated:
To have hour-long windows that trigger every 100msec, you could use sliding windows. However, you would have 10 * 60 * 60 windows, and every event would be placed into each of these 36000 windows. So that's not a great idea.
If you use a GlobalWindow with a DeltaTrigger, then the window will be triggered only when events are more than 100msec apart, which isn't what you've said you want.
I suggest you look at ProcessFunction. It should be straightforward to get what you want that way.

celery_tasktree does not support DAG workflow in general. Is there an alternative?

celery_tasktree (https://pypi.python.org/pypi/celery-tasktree) provides a cleaner workflow canvas compared to celery workflow scheduler (http://docs.celeryproject.org/en/latest/userguide/canvas.html). However, it only support tree-like workflow structure, not a general DAG-like workflow. Celery workflow does have "chords" method but seems cumbersome to use.
Is there any other celery-based library similar to celery_tasktree that works with general DAG workflow?
Here are couple of libraries that support DAG based job schedulers.
https://github.com/thieman/dagobah
https://github.com/apache/incubator-airflow
They are not based on celery. However you can create your own primitives in celery to forward results which can used to build a DAG job schedulers.
#app.task(bind=True)
def forward(self, result, sig):
# convert JSON serialized signature back to Signature
sig = self.app.signature(sig)
# get the return value of the provided task signature
result2 = sig()
# next task will receive tuple of (result_A, result_B)
return (result, result2)
#app.task
def C(self, Ares_Bres)):
Ares, Bres = Ares__Bres
return Ares + Bres
workflow = (A.s() | forward.s(B.s()) | C.s())
See here for a more detailed discussion.

Any way to ensure frisby.js test API calls go in sequential order?

I'm trying a simple sequence of tests on an API:
Create a user resource with a POST
Request the user resource with a GET
Delete the user resource with a DELETE
I've a single frisby test spec file mytest_spec.js. I've broken the test into 3 discrete steps, each with their own toss() like:
f1 = frisby.create("Create");
f1.post(post_url, {user_id: 1});
f1.expectStatus(201);
f1.toss();
// stuff...
f2 = frisby.create("Get");
f2.get(get_url);
f2.expectStatus(200);
f2.toss();
//Stuff...
f3 = frisby.create("delete");
f3.get(delete_url);
f3.expectStatus(200);
f3.toss();
Pretty basic stuff, right. However, there is no guarantee they'll execute in order as far as I can tell as they're asynchronous, so I might get a 404 on test 2 or 3 if the user doesn't exist by the time they run.
Does anyone know the correct way to create sequential tests in Frisby?
As you correctly pointed out, Frisby.js is asynchronous. There are several approaches to force it to run more synchronously. The easiest but not the cleanest one is to use .after(() -> ... you can find more about after() in Fisby.js docs.

Celery - error handling and data storage

I'm trying to better understand common strategies regarding results and errors in Celery.
I see that results have statuses/states and stores results if requested -- when would I use this data? Should error handling and data storage be contained within the task?
Here is a sample scenario, in case it helps better understand my objective:
I have a geocoding task that goeocodes user addresses. If the task fails or succeeds, I'd like to update a field in the database letting the user know. (Error handling) On success I'd like the geocoded data to be inserted into the database (Data storage)
What approach should take?
Let me preface this by saying that I'm still getting a feel for Celery myself. That being said, I have some general inclinations about how I'd go about tackling this, and since no one else has responded, I'll give it a shot.
Based on what you've written, a relatively simple (though I suspect non-optimized) solution is to follow the broad contours of the blog comment spam task example from the documentation.
app.models.py
class Address(models.Model):
GEOCODE_STATUS_CHOICES = (
('pr', 'pre-check'),
('su', 'success'),
('fl', 'failed'),
)
address = models.TextField()
...
geocode = models.TextField()
geocode_status = models.CharField(max_length=2,
choices=GEOCODE_STATUS_CHOICES,
default='pr')
class AppUser(models.Model):
name = models.CharField(max_length=100)
...
address = models.ForeignKey(Address)
app.tasks.py
from celery import task
from app.models import Address, AppUser
from some_module import geocode_function #assuming this returns a string
#task()
def get_geocode(appuser_pk):
user = AppUser.objects.get(pk=appuser_pk)
address = user.address
try:
result = geocode_function(address.address)
address.geocode = result
address.geocode_status = 'su' #set address object as successful
address.save()
return address.geocode #this is optional -- your task doesn't have to return anything
on the other hand, you could also choose to decouple the geo-
code function from the database update for the object instance.
Also, if you're thinking about chaining tasks together, you
might think about if it's advantageous to pass a parameter as
an input or partial input into the child task.
except Exception as e:
address.geocode_status = 'fl' #address object fails
address.save()
#do something_else()
raise #re-raise the error, in case you want to trigger retries, etc
app.views.py
from app.tasks import *
from app.models import *
from django.shortcuts import get_object_or_404
def geocode_for_address(request, app_user_pk):
app_user = get_object_or_404(AppUser, pk=app_user_pk)
...etc.etc. --- **somewhere calling your tasks with appropriate args/kwargs
I believe this meets the minimal requirements you've outlined above. I've intentionally left the view undeveloped since I don't have a sense of how exactly you want to trigger it. It sounds like you also may want some sort of user notification when their address can't be geocoded ("I'd like to update a field in a database letting a user know"). Without knowing more about the specifics of this requirement, I would it sounds like something that might be best accomplished in your html templates (if instance.attribute value is X, display q in template) or by using a django.signals (set up a signal for when a user.address.geocode_status switches to failure -- say, by emailing the user to let them know, etc.).
In the comments to the code above, I mentioned the possibility of decoupling and chaining the component parts of the get_geocode task above. You could also think about decoupling the exception handling from the get_geocode task, by writing a custom error handler task, and using the link_error parameter (for instance., add.apply_async((2, 2), link_error=error_handler.s(), where error_handler has been defined as a task in app.tasks.py ). Also, whether you choose to handle errors via the main task (get_geocode) or via a linked error handler, I would think that you would want to get much more specific about how to handle different sorts of errors (e.g., do something with connection errors different than with address data being incorrectly formatted).
I suspect there are better approaches, and I'm just beginning to understand how inventive you can get by chaining tasks, using groups and chords, etc. Hope this helps at least get you thinking about some of the possibilities. I'll leave it to others to recommend best practices.