Adding tasks to existing group or chord - celery

consider following scenario;
There are 3 different type of tasks. A, B and C. A is meant to produce an input for B and B is supposed to create many C tasks after receiving an input from A.
At the start, i can only be able to define group(A, B) as Cs are executed by B. But i want to wait for all C tasks to be finished as well in order to conclude that main task is done.
Is there a way of doing that by using celery utilities?

The solution i use so far is waiting for C tasks inside B.
Something like;
from celery.result import allow_join_result
def B():
tasks = get_c_tasks()
g = group(tasks)
gr = g.apply_async()
with allow_join_result():
return gr.join(propagate=False)

Related

ScalaTest: How to mix parallel and sequential tests

Say I have 6 test suites: A B C D E F, and I want A B C to run sequentially and THEN run D E F in parrallel.
With an output like that:
A
B
C // always in that order
E
D
F // The order doesn't matter
The idea is to be able to test ABC in isolation of the rest of the tests.
What I have already tried
Create a super sequential test class like that and adding #DoNotDiscover on the sequential tests.
class MasterSuite extends Stepwise(
Sequential(new A, new B, new C)
)
But, even if A B C are run sequentially, then are run in parrallel with the other tests.
I have also tried that
class MasterSuite extends Stepwise(
Sequential(new A, new B, new C),
Suites(new D, new E, new F)
)
But for me it run me all tests sequentially (maybe I have miss something in the build.sbt file).
The documentation for Stepwise says the following:
The difference between Stepwise and Sequential is that although Stepwise executes its own nested suites sequentially, it passes whatever distributor was passed to it to those nested suites. Thus the nested suites could run their own nested suites and tests in parallel if that distributor is defined. By contrast, Sequential always passes None for the distributor to the nested suites, so any and every test and nested suite contained within the nested suites passed to the Sequential construtor will be executed sequentially.
So the obvious question is: what Distributor is passed to the runNestedSuites method of MasterSuite? Because that Distributor is what's ultimately going to be passed to the runNestedSuites method of the Suites object that contains D, E and F.
Through experimentation with a debugger, I found that the Distributor is normally None. But if you mix the ParallelTestExecution trait into your MasterSuite class, you will get a Some instead, and I've verified that in a debugger too.
class MasterSuite extends Stepwise(
new A,
new B,
new C,
new Suites(new D, new E, new F)) with ParallelTestExecution
Now, the MasterSuite will run A, B and C sequentially and then start running the other suites in parallel.
So, problem solved? Unfortunately no, because while it apparently started running D, E and F in parallel, it didn't wait for them to finish and just declared them all successful – even though I deliberately added a failing test in F to see if everything works correctly. So as far as I can see, this is how it's supposed to be done and it's just broken.
Which leads me to my personal conclusion after many years of experience with ScalaTest: it's a bug-ridden piece of garbage, and I would highly recommend staying away from it. I'm sorry I can't give a more optimistic answer than that.
Could be related to this issue, which I just reported.

how does building a big task computation compare to execute synchronously several steps?

I have the following two pieces of code written in Scala/Monix:
def f1(input) =
for {
a <- task1(input)
b <- task2(a)
c <- task3(b)
} yield (c).runSyncUnsafe
and
def f2(input) = {
val a = task1(input).runSyncUnsafe
val b = task2(a).runSyncUnsafe
task3(b).runSyncUnsafe
}
I think the version f1 is better as it fully async and it doesn't block threads and my assumption is that, if there are many tasks running, the first should perform better in multithreading.
I know I should write a test to compare the two implementations but it would require a lot of refactoring of the legacy code. Also the profiling of the two versions is not easy in our specific situation so I'm asking here first, hoping for an answer from somebody with a lot of Scala/Monix experience:
How should the two compare in terms of performance under heavy load? Is this a real concern or is it a non-issue?
As a general rule is better to stay async for as long as possible. So you could write f1 like this:
def f1(input) =
for {
a <- task1(input)
b <- task2(a)
c <- task3(b)
} yield c
The caller can then decide whether to call runSyncUnsafe or an async call (runAsync, runOnComplete) or flatMap it with another task. This removes the Unsafe call from your code and leaves it to the caller to decide whether to be safe or not.
As far as performance goes, the tasks will be evaluated sequentially either way because later tasks depend on the results of earlier tasks.

Issues with Spark Serialization?

Let be a job that contains two phases that (for convenience) cannot be merged. Let's call A the first step and B the second step. Hence, we always need to do A then B.
Workflow
start new cluster with a job A
build MyOutput
count(MyOutput) = 2000 (*)
write(MyOutput)
start new cluster with a job B
read(MyOutput)
count(MyOutput) = 1788 (**).
Precisions
A provides an output which is an RDD[MyObject], namely MyOutput. To write MyOutput, here is what I do: MyOutput.saveAsObjectFile("...")
Then B uses MyOutput as an input, reading the previously written file. Here is what I do: val MyOutput: RDD[MyObject] = sc.objectFile("...")
A and B happen on two separate clusters.
Problem
The main point is that the problem does not ALWAYS happen but when this problem appears, (*) < (**) -- it seems we lost some data whereas we should not have any differences between these data. Obvisously something is wrong.
Do you know what happens here? How to solve that?

What exactly are celery partials for?

I didn't get the exact idea from celery documentation of what the celery partials are for. I may want to use that, but not sure if it my idea is correct.
Lets's say I have following two tasks:
add(a, b, c)
multiply(d, e)
Let's assume both tasks take a bit longer to complete. Is is possible to use partials to:
run add(?, b, c) in parallel with multiply(d, e)
pass the results of multiply(d, e) as the last argument to add()?
This way adding b and c and multiplication of d and e run in parallel and when both are done, only the result of multiplication is passed to the add task. This could save some time, because b and c sum is already computed and in the second step only a is added to the pre-computed result?
If so, how can I achieve that? I mean what is the way to wait in add task for the a argument to be provided? I tried, but didn't find any relevant docs on that topic...
No, you have an incorrect idea about how celery partials work.
They can not be executed until all parameters have been specified.
If you do the following
ch = chain(multiply.s(d, e), add.s(b, c))
ch.apply_async()
what happens is that multiply is run asynchronously. Once it is done the result is passed to add which is then run asynchronously.
In order to achieve the parallelization you speak of you could use the following:
#app.task
def add(a, b):
return a + b
ch = chord(group(multiply.s(d, e), add.s(b, c)))(add.s())

Best way to execute concurrent tasks with dependencies

I have several tasks I need to execute. Some have depedencies on other tasks. Is there anything in scala.concurrent (or other libraries) that would make this easier?
For example, there are four tasks, A, B, C, and D. Task A depends on nothing, task B depends on tasks A and D, task C depends on A and B, and task D depends on nothing. Given a thread pool, do as much as possible as soon as possible.
This is similar to what Make or sbt can do with parallelizing tasks, using a dependency graph. (But neither of these are good fits, since I am not building; I am executing application logic which benefits from concurrent execution.)
"A, B, C, and D. Task A depends on nothing, task B depends on tasks A and D, task C depends on A and B, and task D depends on nothing. "
val D = Future(…) //D depends on Nothing
val A = Future(…) //A depends on Nothing
val B = A zip D map (…) // B depends on A and D
val C = A zip B map (…) // C depends on A and B
Perhaps not the answer you are looking for, but one the the main features of the Disruptor is to deal with task dependencies. It only works if you have fixed task dependencies, not dynamic tasks dependencies, because you need to set up this layout in the beginning.
https://github.com/LMAX-Exchange/disruptor
As a nice bonus, it also is very fast.
I don't know about solutions for Scala, but some other alternatives for JVM are:
for Java8: CompletableFuture. Use thenCombine method to implement dependency on 2 tasks. Beware of a bug for large dependence graphs.
for Groovy: GPARS dataflow.
for Java7: dataflow-for-java - most compact and simple to use. I developed it specifically for tasks like yours. It has 2 base classes: for on-time tasks with dependencies on other tasks (org.df4j.core.func.Function), and for repeating tasks (org.df4j.core.actor.Actor - it is like Akka actor, but with several input ports). And it is an order of magnitude more performant (better say, order of magnitude less overhead) than GPARS.