Celery - running a set of tasks with complex dependencies - celery

In the application I'm working on, a user can perform a "transition" which consists of "steps". A step can have an arbitrary number of dependencies on other steps. I'd like to be able to call a transition and have the steps execute in parallel as separate Celery tasks.
Ideally, I'd like something along the lines of celery-tasktree, except for directed acyclic graphs in general, rather than only trees, but it doesn't appear that such a library exists as yet.
The first solution that comes to mind is a parallel adaptation of a standard topological sort - rather than determining a linear ordering of steps which satisfy the dependency relation, we determine the entire set of steps that can be executed in parallel at the beginning, followed by the entire set of steps that can be executed in round 2, and so on.
However, this is not optimal when tasks take a variable amount of time and workers have to idle waiting for a longer running task while there are tasks that are now ready to run. (For my specific application, this solution is probably fine for now, but I'd still like to figure out how to optimise this.)
As noted in https://cs.stackexchange.com/questions/2524/getting-parallel-items-in-dependency-resolution, a better way is operate directly off the DAG - after each task finishes, check whether any of its dependent tasks are now able to run, and if so, run them.
What would be the best way to go about implementing something like this? It's not clear to me that there's an easy way to do this.
From what I can tell, Celery's group/chain/chord primitives aren't flexible enough to allow me to express a full DAG - though I might be wrong here?
I'm thinking I could create a wrapper for tasks which notifies dependent tasks once the current task finishes - I'm not sure what the best way to handle such a notification would be though. Accessing the application's Django database isn't particularly neat, and would make it hard to spin this out into a generic library, but Celery itself doesn't provide obvious mechanisms for this.

I also faced this problem but i couldn't really find a better solution or library except for one library, For anyone still interested, you can checkout
https://github.com/selinon/selinon. Although its only for python 3, It seems to be the only thing that does exactly what you want.
Airflow is another option but airflow is used in a more static environment just like other dag libraries.

Related

Manually check requests on port in kdb

From what I understand the main q thread monitors it socket descriptors for requests and respond to them.
I want to use a while loop in my main thread that will go on for an indefinite period of time. This would mean, that I will not be able to use hopen on the process port and perform queries.
Is there any way to manually check requests within the while loop.
Thanks.
Are you sure you need to use a while loop? Is there any chance you could, for instance, instead use the timer functionality of KDB+?
This could allow you to run a piece of code periodically instead of looping over it continually. Depending on your use case, this may be more appropriate as it would allow you to repeatedly run a piece of code (e.g. that could be polling something periodically), without using the main thread constantly.
KDB+ is by default single-threaded, which makes it tricky to do what you want to do. There might be something you can do with slave threads.
If you're interested in using timer functionality, but the built-in timer is too limited for your needs, there is a more advanced set of timer functionality available free from AquaQ Analytics (disclaimer: I work for AquaQ). It is distributed as part of the TorQ KDB framework, the specific script you'd be interested in is timer.q, which is documented here. You may be able to use this code without the full TorQ if you like, you may need some of the other "common" code from TorQ to provide functions used within timer.q

Dynamically control order of tests with pytest

I would like to control the order of my tests using logic which will reorder them on the fly, while they are already running.
My use case is this: I am parallelizing my tests with xdist, and each test uses external resources from a common and limited pool. Some tests use more resources than others, so at any given time when only a fraction of the resources are available, some of the tests have the resources they need to run and others don't.
I want to optimize the usage of the resources, so I would like to dynamically choose which test will run next, based on the resources currently available. I would calculate an optimal ordering during the collection stage, but I don't know in advance how long each test will take, so I can't predict which resources will be available when.
I haven't found a way to implement this behavior with a plugin because the collection stage seems to be distinct from the running stage, and I don't know another way to modify the list of tests to run other than the collection hooks.
Would very much appreciate any suggestions, whether a way to implement this or an alternative idea which would optimize the resources utilization. My alternative is to write my own simplistic test runner, but I don't want to give up on the rest of what pytest offers.
Thanks!
Not really an full answer to your problem, but to your question on pytest, and few hints.
To change the order of the tests in pytest (just pytest, not pytest-xdist), you can just alter the order of the test items on the go by installing this hook wrapper:
conftest.py:
import pytest
import random
random.seed()
#pytest.hookimpl(hookwrapper=True)
def pytest_runtest_protocol(item, nextitem):
yield
idx = item.session.items.index(item)
remaining = item.session.items[idx+1:]
random.shuffle(remaining)
item.session.items[idx+1:] = remaining
It makes no sense to change what was already executed, but only what remains — hence, [idx+1:]. In this example, I just shuffle the items, but you can do whatever you want with the list of the remaining features.
Keep in mind: This can affect how the fixtures are setup'ed & teardown'ed (those of scope above 'function'). Originally, pytest orders the tests so they can utilise the fixtures in a most efficient way. And specifically, the nextitem argument is used internally to track if the fixture should be finished. Since you efficiently change it every time, the effects can be unpredictable.
With pytest-xdist, it all goes totally different. You have to read on how pytest-dist distributes the tests across the slaves.
First, every slave collects all the same tests and exactly in the same order. If the order is different, pytest-xdist will fail. So you cannot reorder them on collection.
Second, the master process sends the test indexes in that collected list as the next tests to execute. So, the collection must be unchanged all the time.
Third, you can redefine the pytest_xdist_make_scheduler hook. There are few sample implementations in the pytest-xdist itself. You can define your own logic of scheduling the tests in schedule() method, using the nodes added/removed and tests collected via the corresponding methods.
Fourth, that would be too easy. The .schedule() method is called only in slave_collectionfinish event sent from the slave. I'm sad to say this, but you will have to kill and restart the slave processes all the time to trigger this event, and to re-schedule the remaining tests.
As you can see, the pytest-xdist's implementation will be huge & complex. But I hope this will give you some hints where to look at.

Getting Recursive Tasks in Asana with reasonable performance

I'm using the Asana REST API to iterate over workspaces, projects, and tasks. After I achieved the initial crawl over the data, I was surprised to see that I only retrieved the top-level tasks. Since I am required to provide the workspace and project information, I was hoping not to have to recurse any deeper. It appears that I can recurse on a single task with the \subtasks endpoint and re-query... wash/rinse/repeat... but that amounts to a potentially massive number of REST calls (one for each subtask to see if they, in turn, have subtasks to query - and so on).
I can partially mitigate this by adding to the opt_fields query parameter something like:
&opt_fields=subtasks,subtasks.subtasks
However, this doesn't scale well. It means I have to elongate the query for each layer of depth. I suppose I could say "don't put tasks deeper than x layers deep" - but that seems to fly in the face of Asana's functionality and design. Also, since I need lots of other properties, it requires me to make a secondary query for each node in the hierarchy to gather those. Ugh.
I can use the path method to try to mitigate this a bit:
&opt_fields=(this|subtasks).(id|name|etc...)
but again, I have to do this for every layer of depth. That's impractical.
There's documentation about this great REPEATER + operator. Supposedly it would work like this:
&opt_fields=this.subtasks+.name
That is supposed to apply to ALL subtasks anywhere in the hierarchy. In practice, this is completely broken, and the REST API chokes and returns only the ids of the top-level tasks. :( Apparently their documentation is just wrong here.
The only method that seems remotely functional (if not practical) is to iterate first on the top-level tasks, being sure to include opt_fields=subtasks. Whenever this is a non-empty array, I would need to recurse on that task, query for its subtasks, and continue in that manner, until I reach a null subtasks array. This could be of arbitrary depth. In practice, the first REST call yields me (hopefully) the largest number of tasks, so the individual recursion may be mitigated by real data... but it's a heck of an assumption.
I also noticed that the limit parameter applied ONLY to the top-level tasks. If I choose to expand the subtasks, say. I could get a thousand tasks back instead of 100. The call could timeout if the data is too large. The safest thing to do would be to only request the ids of subtasks until recursion, and as always, ask for all the desired top-level properties at that time.
All of this seems incredibly wasteful - what I really want is a flat list of tasks which include the parent.id and possibly a list of subtasks.id - but I don't want to query for them hierarchically. I also want to page my queries with rational data sizes in mind. I'd like to get 100 tasks at a time until Asana runs out - but that doesn't seem possible, since the limit only applies to top-level items.
Unfortunately the repeater didn't solve my problem, since it just doesn't work. What are other people doing to solve this problem? And, secondarily, can anyone with intimate Asana insight provide any hope of getting a better way to query?
While I'm at it, a suggested way to design this: the task endpoint should not require workspace or project predicate. I should be able to filter by them, but not be required to. I am limited to 100 objects already, why force me to filter unnecessarily? In the same vein - navigating the hierarchy of Asana seems an unnecessary tax for clients who are not Asana (and possibly even the Asana UI itself).
Any ideas or insights out there?
Have you ensured that the + you send is URL-encoded? Whatever library you are using should usually handle this (which language are you using, btw? We have some first-party client libraries available)
Try &opt_fields=this.subtasks%2B.name if you're creating the URL manually, or (better yet) use a library that correctly encodes URL query parameters.

Make A step in a SAS macro timeout after a set interval

I'm on SAS 9.1.3 (on a server) and have a macro looping over an array to feed a computationally intensive set of modelling steps which are appended out to a table. I'm wondering if it is possible to set a maximum time to run for each element of the array. This is so that any element which takes longer than 3 minutes to run is skipped and the next item fed in.
Say for example I'm using a proc nlin with a by statement to build separate models per class on a large data set, and one class is failing to converge; how do I skip over that class?
Bit of a niche requirement, hope someone can assist!
The only approach I can think of here would be to rewrite your code so that it runs each by group separately from the rest, in one or more SAS/CONNECT sessions, have the parent session kill each one after a set timeout, and then recombine the surviving output.
As Dom and Joe have pointed out, this is not a trivial task, but it's possible if you're sufficiently keen on learning about that aspect of SAS. A good place to get started for this sort of thing would be this page:
http://support.sas.com/rnd/scalability/tricks/connect.html
I was able to use the examples there and elsewhere as the basis of a simple parallel processing framework (in SAS 9.1.3, coincidentally!), but there are many details you will need to consider. To give you an idea of the sorts of adventures in store if you go down this route:
Learning how to sign on to your server via SAS/CONNECT within whatever infrastructure you're using (will the usual autoexec file work? What invocation options do you need to use?)
Explaining to your sysadmin/colleagues why you need to run multiple processes in parallel
Managing asynchronous sessions
Syncing macro variables, macro definitions, libraries and formats between sessions
Obscure bugs (I wasn't able to use the usual option for syncing libraries and had to roll my own via call execute...)
One could write a (lengthy) SUGI paper on this topic, and I'm sure there are plenty of them out there if you look around.
In general, SAS is running in a linear manner. So you cannot write a step to monitor another step in the same program. What you could do is run your code in a SAS/CONNECT session and monitor it with the process that started the session. That's not trivial and the how to is beyond the scope of Stack Overflow.
For a data step, use the datetime() function to get the current system date and time. This is measured in seconds. You can check the time inside your data step. Stop a data step with the stop; statement.
Now you specifically asked about breaking a specific step inside a PROC. That must be implemented in the PROC by the SAS developer. If it is possible, it will be documented in the procedure's documentation. View SAS documentation at http://support.sas.com/documentation/.
For PROC NLIN, I do not think there is a "break after X" parameter. You can use the trace parameters to track model execution to see what it hanging up. You can then work on changing the convergence parameters to attempt to speed up slow, badly converging, models.

SpringXD Job split flow steps running in separate containers in distributed mode

I am aware of the nested job support (XD-1972) work and looking forward to that. A question regarding split flow support. Is there a plan to support running parallel steps, as defined in split flows, in separate containers?
Would it be as simple as providing custom implementation of a proper taskExecutor, or it is something more involved?
I'm not aware of support for splits to be executed across multiple containers being on the roadmap currently. Given the orchestration needs of something like that, I'd probably recommend a more composed approach anyways.
A custom 'TaskExecutor` could be used to farm out the work, but it would be pretty specialized. Each step within the flows in a split are executed within the scope of a job. That scope (and all it's rights and responsibilities) would need to be carried over to the "child" containers.