Make A step in a SAS macro timeout after a set interval - macros

I'm on SAS 9.1.3 (on a server) and have a macro looping over an array to feed a computationally intensive set of modelling steps which are appended out to a table. I'm wondering if it is possible to set a maximum time to run for each element of the array. This is so that any element which takes longer than 3 minutes to run is skipped and the next item fed in.
Say for example I'm using a proc nlin with a by statement to build separate models per class on a large data set, and one class is failing to converge; how do I skip over that class?
Bit of a niche requirement, hope someone can assist!

The only approach I can think of here would be to rewrite your code so that it runs each by group separately from the rest, in one or more SAS/CONNECT sessions, have the parent session kill each one after a set timeout, and then recombine the surviving output.
As Dom and Joe have pointed out, this is not a trivial task, but it's possible if you're sufficiently keen on learning about that aspect of SAS. A good place to get started for this sort of thing would be this page:
http://support.sas.com/rnd/scalability/tricks/connect.html
I was able to use the examples there and elsewhere as the basis of a simple parallel processing framework (in SAS 9.1.3, coincidentally!), but there are many details you will need to consider. To give you an idea of the sorts of adventures in store if you go down this route:
Learning how to sign on to your server via SAS/CONNECT within whatever infrastructure you're using (will the usual autoexec file work? What invocation options do you need to use?)
Explaining to your sysadmin/colleagues why you need to run multiple processes in parallel
Managing asynchronous sessions
Syncing macro variables, macro definitions, libraries and formats between sessions
Obscure bugs (I wasn't able to use the usual option for syncing libraries and had to roll my own via call execute...)
One could write a (lengthy) SUGI paper on this topic, and I'm sure there are plenty of them out there if you look around.

In general, SAS is running in a linear manner. So you cannot write a step to monitor another step in the same program. What you could do is run your code in a SAS/CONNECT session and monitor it with the process that started the session. That's not trivial and the how to is beyond the scope of Stack Overflow.
For a data step, use the datetime() function to get the current system date and time. This is measured in seconds. You can check the time inside your data step. Stop a data step with the stop; statement.
Now you specifically asked about breaking a specific step inside a PROC. That must be implemented in the PROC by the SAS developer. If it is possible, it will be documented in the procedure's documentation. View SAS documentation at http://support.sas.com/documentation/.
For PROC NLIN, I do not think there is a "break after X" parameter. You can use the trace parameters to track model execution to see what it hanging up. You can then work on changing the convergence parameters to attempt to speed up slow, badly converging, models.

Related

kdb - customized data streaming/ticker plant?

We've been using kdb to handle a number of calculations focused more on traditional desktop sources. We have deployed our web application and are looking to make the leap as to how best to pick up data changes and re-calculate them in kdb to render a "real-time" view of the data as it changes.
From what I've been reading, the use of data loaders(feed handlers) into our own equivalent of a "ticker plant" as a data store is the most documented ideal solution. So far, we have been "pushing" data into kdb directly and calculating as part of a script so we are trying to make the leap from calculation-on-demand to a "live" calculation as data inputs are edited by user.
I'm trying to understand how to manage the feed handlers and timing of updates. We really only want to move data when it changes (web-front end so trying to figure out how best to "trigger" when things change (such as save or lost focus on an editable data grid for example.) We are also thinking our database as the "ticker plant" itself which may minimize feedhandlers.
I found a reference below and it looks like its running a forever-loop which feels excessive but understand the original use case for kdb and streaming data.
Feedhandler - sending data to tickerplant
Does this sound like a solid workflow?
Many thanks in advance!
Resources we've referencing:
Official Manual -https://github.com/KxSystems/kdb/blob/master/d/tick.htm
kdb+ Tick overview: http://www.timestored.com/kdb-guides/kdb-tick-data-store
Source code: https://github.com/KxSystems/kdb-tick
There's a lot to parse here but some general thoughts/ideas:
Yes, most examples of feedhandlers are set up as forever loops but this is often just for convenience for demoing.
Ideally a live data flow should work based on event handling, aka on-event triggers. Kdb/q has this out of the box in the form of the .z handlers. Other languages should have similar concepts of event handling
Some more examples of python/java feeders are here: https://github.com/exxeleron
There's also some details on the official Kx site: https://code.kx.com/q/wp/capi/#publishing-to-a-kdb-tickerplant
It still might be a viable option to have a forever loop, or at least a short timer in the event you want to batch data.
Depending on the amount of dataflow a tickerplant might be overkill for your use-case, but a tickerplant is still useful for (a) separating your processing from the processing of dataflow (i.e. data can still flow through the tickerplant while another process is consuming/calculating) and (b) logging data for recovery purposes.

How to find $plusargs with same string in different locations

Very general issue in large integration of verification environment.
Our verification development involves large group across different time zone.
Group has preference to use $plusargs instead factory mechanism.
Probably main reason it is hard to set factory from command line processor,
we have more layers of scripts to start simulation.
Recently i found that same string been used in different environment to control behavior of environment. In this case two different score board used same string to disable some checking and test pass. Both those environment some time created at run time. Also some time it is OK to re-use same string, and it will require owner to be involved.
Is there any way to find duplication like this from final elaborated model, and provide locations in code as a warning?
I thought create our own wrapper, but problem that we are integrating some code that we are not owners as in this case was.
Thanks,
This is a perfect example of how people think they can get things done quicker by not following the recommended UVM methodology and instead create time consuming complexity later on.
I see at least two possible options.
Write a script that searches the source code for $plusargs and hopefully they have used string literals for you to trace for duplicates.
You can override $plusargs with PLI code and have it trace duplicates.
The choice depends on wether you are better at writing Perl/Python or C code.

Manually check requests on port in kdb

From what I understand the main q thread monitors it socket descriptors for requests and respond to them.
I want to use a while loop in my main thread that will go on for an indefinite period of time. This would mean, that I will not be able to use hopen on the process port and perform queries.
Is there any way to manually check requests within the while loop.
Thanks.
Are you sure you need to use a while loop? Is there any chance you could, for instance, instead use the timer functionality of KDB+?
This could allow you to run a piece of code periodically instead of looping over it continually. Depending on your use case, this may be more appropriate as it would allow you to repeatedly run a piece of code (e.g. that could be polling something periodically), without using the main thread constantly.
KDB+ is by default single-threaded, which makes it tricky to do what you want to do. There might be something you can do with slave threads.
If you're interested in using timer functionality, but the built-in timer is too limited for your needs, there is a more advanced set of timer functionality available free from AquaQ Analytics (disclaimer: I work for AquaQ). It is distributed as part of the TorQ KDB framework, the specific script you'd be interested in is timer.q, which is documented here. You may be able to use this code without the full TorQ if you like, you may need some of the other "common" code from TorQ to provide functions used within timer.q

Multi-instance and Loop in BPMN

I am trying to model a certain behaviour, where couple of activities in differents swimlanes supposed to be processed in a loop. Now BPMN uses tokens to ilustrate the flow and paths taken. I wonder how such tokens work in case of loops. Does every activity iteration creates a token which consequently travel through the connected activities?
E.g. Let's say Activity1 will be performed in a loop 10 times. Will that create 10 tokens where each will travel through the remaining activities of the process? Such behaviour would be undesirable, however if I am not mistaken multi-instance activities work that way.
The only solution on my mind which would comply with BPMN specification would be to create a Call activity for the whole block of activities and then run the Call activity in a loop.
Can anyone clarify for me the use of loops and multi-instances in BPMN from the view of tokens?
Thank you in advance!
Based upon my reading of the documentation: https://www.omg.org/spec/BPMN/2.0/PDF The answer from #qwerty_so does not seem to conform to the standard, although in part this seems to be because the question also seems imprecise or at least underspecified.
A token (see glossary) is simply an imaginary object that represents the flow unit in the process diagram. There are at least three different types of loops specified in the standard, which suggest different implications for the flow unit.
Sections 13.2.6 and 12.2.7 describe Loop Activity and Multiple Instance Activities respectively. While the latter, on its face, might not seem like a loop, the standard defines attributes of the activity that suggest otherwise including: MultipleInstanceLoopCharacteristics and ExpressionloopCardinality.
In the former case, it seems that the operational semantics suggest a single flow unit that repeats multiple times according to some policy or even unbounded.
In the latter case, the activity has "multiple instances spawned," including a parallel variant.
That multiple instances can flow forward in parallel, on its face, suggests that the system must at least allow for the possibility of spawning multiple tokens (or conceptually splitting the original token) to support multiple threads proceeding simultaneously along different paths.
That said, the Loop Activity (13.2.6) appears to support the OP's desired semantics.

Celery - running a set of tasks with complex dependencies

In the application I'm working on, a user can perform a "transition" which consists of "steps". A step can have an arbitrary number of dependencies on other steps. I'd like to be able to call a transition and have the steps execute in parallel as separate Celery tasks.
Ideally, I'd like something along the lines of celery-tasktree, except for directed acyclic graphs in general, rather than only trees, but it doesn't appear that such a library exists as yet.
The first solution that comes to mind is a parallel adaptation of a standard topological sort - rather than determining a linear ordering of steps which satisfy the dependency relation, we determine the entire set of steps that can be executed in parallel at the beginning, followed by the entire set of steps that can be executed in round 2, and so on.
However, this is not optimal when tasks take a variable amount of time and workers have to idle waiting for a longer running task while there are tasks that are now ready to run. (For my specific application, this solution is probably fine for now, but I'd still like to figure out how to optimise this.)
As noted in https://cs.stackexchange.com/questions/2524/getting-parallel-items-in-dependency-resolution, a better way is operate directly off the DAG - after each task finishes, check whether any of its dependent tasks are now able to run, and if so, run them.
What would be the best way to go about implementing something like this? It's not clear to me that there's an easy way to do this.
From what I can tell, Celery's group/chain/chord primitives aren't flexible enough to allow me to express a full DAG - though I might be wrong here?
I'm thinking I could create a wrapper for tasks which notifies dependent tasks once the current task finishes - I'm not sure what the best way to handle such a notification would be though. Accessing the application's Django database isn't particularly neat, and would make it hard to spin this out into a generic library, but Celery itself doesn't provide obvious mechanisms for this.
I also faced this problem but i couldn't really find a better solution or library except for one library, For anyone still interested, you can checkout
https://github.com/selinon/selinon. Although its only for python 3, It seems to be the only thing that does exactly what you want.
Airflow is another option but airflow is used in a more static environment just like other dag libraries.