Say I have 6 test suites: A B C D E F, and I want A B C to run sequentially and THEN run D E F in parrallel.
With an output like that:
A
B
C // always in that order
E
D
F // The order doesn't matter
The idea is to be able to test ABC in isolation of the rest of the tests.
What I have already tried
Create a super sequential test class like that and adding #DoNotDiscover on the sequential tests.
class MasterSuite extends Stepwise(
Sequential(new A, new B, new C)
)
But, even if A B C are run sequentially, then are run in parrallel with the other tests.
I have also tried that
class MasterSuite extends Stepwise(
Sequential(new A, new B, new C),
Suites(new D, new E, new F)
)
But for me it run me all tests sequentially (maybe I have miss something in the build.sbt file).
The documentation for Stepwise says the following:
The difference between Stepwise and Sequential is that although Stepwise executes its own nested suites sequentially, it passes whatever distributor was passed to it to those nested suites. Thus the nested suites could run their own nested suites and tests in parallel if that distributor is defined. By contrast, Sequential always passes None for the distributor to the nested suites, so any and every test and nested suite contained within the nested suites passed to the Sequential construtor will be executed sequentially.
So the obvious question is: what Distributor is passed to the runNestedSuites method of MasterSuite? Because that Distributor is what's ultimately going to be passed to the runNestedSuites method of the Suites object that contains D, E and F.
Through experimentation with a debugger, I found that the Distributor is normally None. But if you mix the ParallelTestExecution trait into your MasterSuite class, you will get a Some instead, and I've verified that in a debugger too.
class MasterSuite extends Stepwise(
new A,
new B,
new C,
new Suites(new D, new E, new F)) with ParallelTestExecution
Now, the MasterSuite will run A, B and C sequentially and then start running the other suites in parallel.
So, problem solved? Unfortunately no, because while it apparently started running D, E and F in parallel, it didn't wait for them to finish and just declared them all successful – even though I deliberately added a failing test in F to see if everything works correctly. So as far as I can see, this is how it's supposed to be done and it's just broken.
Which leads me to my personal conclusion after many years of experience with ScalaTest: it's a bug-ridden piece of garbage, and I would highly recommend staying away from it. I'm sorry I can't give a more optimistic answer than that.
Could be related to this issue, which I just reported.
Related
I'm working with dask delayed functions and I'm getting familiar with the do's and don'ts when using the #dask.delayed decorator on functions. I realized that sometimes I will need to call compute() twice to get the result despite the fact that I thought I followed the best practices. i.e. don't call a dask delayed function within another dask delayed function.
I've run into this problem in two scenarios: when there are nested functions, and when calling a member function in a class that uses class members that are delayed objects.
#dask.delayed
def add(a, b):
return a + b
def inc(a):
return add(a, 1)
#dask.delayed
def foo(x):
return inc(x)
x = foo(3)
x.compute()
class Add():
def __init__(self, a, b):
self.a = a
self.b = b
#dask.delayed
def calc(self):
return self.a+self.b
a = dask.delayed(1)
b = dask.delayed(2)
add = Add(a, b)
add.calc().compute()
In the first example, x.compute() does not return the result but another delayed object, and I will have to call x.compute().compute() to get the actual result. But I believe inc is not a delayed function and therefore it's not against the rule of not calling a delayed function within another delayed function?
In the second example, again I will have to call add.calc().compute().compute() to get the actual result. In this case self.a and self.b are just delayed attributes and there are no nested delayed function anywhere.
Can anyone help me understand why I need to call compute() twice in these two cases? Or even better, could someone briefly explain the general 'rule' when using dask delayed functions? I read the documentation and there's not so much to be found there.
Update:
#malbert pointed out that the examples require calling compute() twice because there is delayed results involved in a delayed function and therefore it counts as 'calling delayed function within another delayed function'. But why something like follows only requires calling compute() once?
#dask.delayed
def add(a,b):
return a+b
a = dask.delayed(1)
b = dask.delayed(2)
c = add(a,b)
c.compute()
In this example, a and b are also both delayed results, and they are used in a delayed function. My random guess would be what actually matters is where the delayed result is in a delayed function? It's probably only fine if they are passed in as arguments?
I think the key lies in understanding more precisely what dask.delayed does.
Consider
my_delayed_function = dask.delayed(my_function)
When used as a decorator on my_function, dask.delayed returns a function my_delayed_function which delays the execution of my_function. When my_delayed_function is called with an argument
delayed_result = my_delayed_function(arg)
this returns an object which contains all the necessary information about the execution of my_function with the argument arg.
Calling
result = delayed_result.compute()
triggers the execution of the function.
Now, the effect of using operators such as + on two delayed results is that a new delayed result is returned which bundles the two executions contained in its inputs. Calling compute on this object triggers this bundle of executions.
So far so good. Now, in your first example, foo calls inc, which calls a delayed function, which returns a delayed result. Therefore, computing foo does exactly this and returns this delayed result. Calling compute on this delayed result (your "second" compute) then triggers its computation.
In your second example, a and b are delayed results. Adding two delayed results using + returns the delayed result of bundling the execution of a,b and their addition. Now, since calc is a delayed function, it returns a delayed result on getting a delayed result. Therefore again, its computation will return a delayed object.
In both cases, you didn't quite follow the best practices. Specifically the point
Avoid calling delayed within delayed functions
since in your first example the delayed add is called within inc, which is called in foo. Therefore you are calling delayed within the delayed foo. In your second example, the delayed calc is working on the delayed a and b, therefore again you are calling delayed within a delayed function.
In your question, you say
But I believe inc is not a delayed function and therefore it's not
against the rule of not calling a delayed function within another
delayed function?
I suspect you might be understanding "calling delayed within delayed functions" wrongly. This refers to everything that happens within the function and is therefore part of it: inc includes a call of the delayed add, therefore delayed is being called in foo.
Addition after question update: Passing delayed arguments to a delayed function bundles the delayed executions into the new delayed result. This is different from "calling delayed within the delayed function" and is part of the intended use case. Actually I also didn't find a clear explanation of this in the documentation, but one entry point might be this: unpack_collections is used to process delayed arguments. Even if this should remain somewhat unclear, sticking to the best practices (interpreted this way) should produce a reproducible behaviour regarding the output of compute().
The following codes result when sticking to "Avoid calling delayed within delayed functions" and return a result after a single call of compute:
First example:
##dask.delayed
def add(a, b):
return a + b
def inc(a):
return add(a, 1)
#dask.delayed
def foo(x):
return inc(x)
x = foo(3)
x.compute()
Second example:
class Add():
def __init__(self, a, b):
self.a = a
self.b = b
##dask.delayed
def calc(self):
return self.a+self.b
a = dask.delayed(1)
b = dask.delayed(2)
add = Add(a, b)
add.calc().compute()
Let be a job that contains two phases that (for convenience) cannot be merged. Let's call A the first step and B the second step. Hence, we always need to do A then B.
Workflow
start new cluster with a job A
build MyOutput
count(MyOutput) = 2000 (*)
write(MyOutput)
start new cluster with a job B
read(MyOutput)
count(MyOutput) = 1788 (**).
Precisions
A provides an output which is an RDD[MyObject], namely MyOutput. To write MyOutput, here is what I do: MyOutput.saveAsObjectFile("...")
Then B uses MyOutput as an input, reading the previously written file. Here is what I do: val MyOutput: RDD[MyObject] = sc.objectFile("...")
A and B happen on two separate clusters.
Problem
The main point is that the problem does not ALWAYS happen but when this problem appears, (*) < (**) -- it seems we lost some data whereas we should not have any differences between these data. Obvisously something is wrong.
Do you know what happens here? How to solve that?
consider following scenario;
There are 3 different type of tasks. A, B and C. A is meant to produce an input for B and B is supposed to create many C tasks after receiving an input from A.
At the start, i can only be able to define group(A, B) as Cs are executed by B. But i want to wait for all C tasks to be finished as well in order to conclude that main task is done.
Is there a way of doing that by using celery utilities?
The solution i use so far is waiting for C tasks inside B.
Something like;
from celery.result import allow_join_result
def B():
tasks = get_c_tasks()
g = group(tasks)
gr = g.apply_async()
with allow_join_result():
return gr.join(propagate=False)
I didn't get the exact idea from celery documentation of what the celery partials are for. I may want to use that, but not sure if it my idea is correct.
Lets's say I have following two tasks:
add(a, b, c)
multiply(d, e)
Let's assume both tasks take a bit longer to complete. Is is possible to use partials to:
run add(?, b, c) in parallel with multiply(d, e)
pass the results of multiply(d, e) as the last argument to add()?
This way adding b and c and multiplication of d and e run in parallel and when both are done, only the result of multiplication is passed to the add task. This could save some time, because b and c sum is already computed and in the second step only a is added to the pre-computed result?
If so, how can I achieve that? I mean what is the way to wait in add task for the a argument to be provided? I tried, but didn't find any relevant docs on that topic...
No, you have an incorrect idea about how celery partials work.
They can not be executed until all parameters have been specified.
If you do the following
ch = chain(multiply.s(d, e), add.s(b, c))
ch.apply_async()
what happens is that multiply is run asynchronously. Once it is done the result is passed to add which is then run asynchronously.
In order to achieve the parallelization you speak of you could use the following:
#app.task
def add(a, b):
return a + b
ch = chord(group(multiply.s(d, e), add.s(b, c)))(add.s())
I have several tasks I need to execute. Some have depedencies on other tasks. Is there anything in scala.concurrent (or other libraries) that would make this easier?
For example, there are four tasks, A, B, C, and D. Task A depends on nothing, task B depends on tasks A and D, task C depends on A and B, and task D depends on nothing. Given a thread pool, do as much as possible as soon as possible.
This is similar to what Make or sbt can do with parallelizing tasks, using a dependency graph. (But neither of these are good fits, since I am not building; I am executing application logic which benefits from concurrent execution.)
"A, B, C, and D. Task A depends on nothing, task B depends on tasks A and D, task C depends on A and B, and task D depends on nothing. "
val D = Future(…) //D depends on Nothing
val A = Future(…) //A depends on Nothing
val B = A zip D map (…) // B depends on A and D
val C = A zip B map (…) // C depends on A and B
Perhaps not the answer you are looking for, but one the the main features of the Disruptor is to deal with task dependencies. It only works if you have fixed task dependencies, not dynamic tasks dependencies, because you need to set up this layout in the beginning.
https://github.com/LMAX-Exchange/disruptor
As a nice bonus, it also is very fast.
I don't know about solutions for Scala, but some other alternatives for JVM are:
for Java8: CompletableFuture. Use thenCombine method to implement dependency on 2 tasks. Beware of a bug for large dependence graphs.
for Groovy: GPARS dataflow.
for Java7: dataflow-for-java - most compact and simple to use. I developed it specifically for tasks like yours. It has 2 base classes: for on-time tasks with dependencies on other tasks (org.df4j.core.func.Function), and for repeating tasks (org.df4j.core.actor.Actor - it is like Akka actor, but with several input ports). And it is an order of magnitude more performant (better say, order of magnitude less overhead) than GPARS.