spark iterative programming - exit condition without launching a job - scala

When writing iterative programs, a common situation is that you need to define a condition at which the program will stop execution and return the result. This stop condition can be for example rdd.isEmpty. The problem is that this "condition test" is an action which triggers a job to be executed and therefore schedueling,serialisation and others costs for each iteration
def iterate(layer:RDD[Long])={
layer.cache()
if(layer.isEmpty) return null;
val nextlayer=process(layer)//contains hashjoins, joins, filters, cache
iterate(nextlayer)
}
The timeline will look like:
[isempty][------spacing----][isempty][------spacing----][isempty]
what is the best way for iterative programming in such situtation? we should not be forced to launch a job in each iteration.
is there a method to check for empty rdd without executing an action?
possible solution:
as you can see in the image below, the is empty is now executed every 5 iterations.each iteration is represented by a periodic triplets of blue rectangles. i did this by modifying the stop condition to the following:
if(layer.index%5==0 && layer.isEmpty) return null;
But as you can see in the figure below am still getting actions that get executed as "run at ThreadPoolExecutor.java". A research shows that those actions are happenning because i am doing "broadcast hash joins" of small DFs with larger ones
threadpoolexecutor reason
timeline

You can try using
layer.cache()
layer.isEmpty
This means that the check for empty is going to trigger an action, but the rdd will be cached, so when you pass it to the process method, the things that were done in isEmpty will be "skipped".

Related

Abort execution of parsim

For the use case of being able to abort parallel simulations with a MATLAB GUI, I would like to stop all scheduled simulations after the user pressed the Stop button.
All simulations are submitted at once using the parsim command, hence something like a callback to my GUI variables (App Designer) would be the most preferable solution.
Approaches I have considered but were not exactly providing a desirable solution:
The Simulation Manager provides the functionality to close simulations using its own interface. If I only had the code its Stop button executes...
parsim uses the Simulink.SimulationInput class as input to run simulations, allowing to modify the preSimFcn at the beginning of each simulation. I have not found a way to "skip" the simulation at its initialization phase apart from intentionally throwing an error so far.
Thank you for your help!
Update 1: Using the preSimFcn to set the the termination time equal to the start time drastically reduces simulation time. But since the first step still is computed there has to be a better solution.
simin = simin.setModelParameter('StopTime',get_param(mdl,'StartTime'))
Update 2: Intentionally throwing an error executing the preSimFcn, for example by setting it to
simin = simin.setModelParameter('SimulationCommand','stop')
provides the shortest termination times for me so far. Though, it requires catching and identifying the error in the ErrorMessageof the Simulink.SimulationOutput object. As this is exactly the "ugly" implementation I wanted to avoid, the issue is still active.
If you are using 17b or later, parsim provides an option to 'RunInBackground'. It returns an array of Future objects.
F = parsim(in, 'RunInBackground', 'on')
Please note that is only available for parallel simulations. The Simulink.Simulation.Future object F provides a cancel method which will terminate the simulation. You can use the fetchOutputs methods to fetch the output from the simulation.
F.cancel();

Talend: how to fork the output of a parent job and call either of child job based on some condition

I am learning Talend. I have a scenario where I have to apply if else if condition to the output from the parent job and based on the outcome, call either of the child jobs. I have thought of few options such as using global variables or context variables. Is it possible to configure the child jobs to listen to the global/context variable change and run if the condition match? I tried looking to configure this, but failed to understand where can I do these configurations.
I even tried taking the output from the parent job component's into a tjavarow where I can write java code with if else if conditions. I was thinking to explicitly call the sub jobs from the if else branching, but I am not able to make any headway. Can someone please direct me through the right approach? Any new approach is also welcomed.
NOTE: We are using free version of Talend.
If I understand correctly, this can be achieved using "Run If" triggers, like this:
Inside tJava, you can write some logic to calculate your variables.
On the If trigger, you write a condition that determines whether or not the component after it is run.
In my example, I'm not actually making use of what's inside tJava, I'm just getting the number of lines output by tLogRow, so it can be left out and the "Run If" triggers connected directly to tLogRow.

Managing multiple anylogic simulations within an experiment

We are developing an ABM under AnyLogic 7 and are at the point where we want to make multiple simulations from a single experiment. Different parameters are to be set for each simulation run so as to generate results for a small suite of standard scenarios.
We have an experiment that auto-starts without the need to press the "Run". Subsequent pressing of the Run does increment the experiment counter and reruns the model.
What we'd like is a way to have the auto-run, or single press of Run, launch a loop of simulations. Within that loop would be the programmatic adjustment of the variables linked to passed parameters.
EDIT- One wrinkle is that some parameters are strings. The Optimization or Parameter Variation experiments don't lend themselves to enumerating a set of strings to be be used across a set of simulation runs. You can set a string per parameter for all the simulation runs within one experiment.
We've used the help sample for "Running a Model from Outside Without Presentation Window", to add the auto-run capability to the initial experiment setup block of code. A method to wait for Run 0 to complete, then dispatch Run 1, 2, etc, is needed.
Pointers to tutorial models with such features, or to a snip of code for the experiment's java blocks are much appreciated.
maybe I don't understand your need but this certainly sounds like you'd want to use a "Parameter Variation" experiment. You can specify which parameters should be varied in which steps and running the experiment automatically starts as many simulation runs as needed, all without animation.
hope that helps
As you, I was confronted to this problem. My aim was to use parameter variation with a model and variation were on non numeric data, and I knew the number of runs to start.
Then i succeed in this task with the help of Custom Variation.
Firstly I build an experiment typed as 'multiple run', create my GUI (user was able to select the string values used in each run.
Then, I create a new java class which inherit from the previous 'multiple run' experiment,
In this class (called MyMultipleRunClass) was present:
- overload of the getMaximumIterations method from default experiment to provide to default anylogic callback the correct number of iteration, and idnex was also used to retrieve my parameter value from array,
- implementation of the static method start,
public static void start() {
prepareBeforeExperimentStart_xjal( MyMultipleRunClass.class);
MyMultipleRunClass ex = new MyMultipleRunClass();
ex.setCommandLuneArguments_xjal(null);
ex.setup(null);
}
Then the experiment to run is the 'empty' customExperiment, which automatically start the other Multiple run experiment thru the presented subclass.
Maybe it exists shortest path, but from my point of view anylogic is correctly used (no trick with non exposed interface) and it works as expected.

Let Matlab continue without waiting for a result

I have the following question: How do I tell Matlab that it should not wait for the results of a function? Is there a way other than threads?
My problem: I have a function A that is called by a Timer every few seconds. If a specific event is met, another function B is called inside function A. Function B opens a Batch File.
I want function A to go on without waiting for function B to end. Is there a way to easily do it?
I'm sorry if this question was already asked, but I couldn't find a satisfying answer. Please also excuse my bad english.
I would like to thank everyone who answers for their help.
In your function B, just call the batch file with a & at the end of the line.
For example:
!mybatch.bat &
This will run the file mybatch.bat in background mode and will return execution to Matlab immediately after the call.
or if you prefer the complete form:
[status, result] = system('mybatch.bat &')
But in this case it is a bit useless, since the system call mybatch in the background, the result variable is always empty and status is always 0 (whether a file mybatch.bat was found and executed or not)
edit: That is the quick trick in case it is only the batch file execution which is slowing down your program.
If you have more matlab instructions in function B and you really need function A to go on without waiting, you will have to set up a listener object with function B as a callback. Then in your function A, trigger the event (which will activate the listener and call function B).

Force parfor to respect some order

I understand that some indeterminism stems from parfor's parallel nature but I don't understand why it should be entirely random. Is there any way to force parfor to respect (at least loosely) the order of the loop? More specifically I would like that in the case of:
parfor i=1:100
do_independent_stuff()
end
each worker of the pool when asking for a new task (i.e. in this case a new iteration of the loop) to be affected the lowest i that hasn't been computed or affected to a worker yet.
I think its by design that running something in parallel assumes that order is not important, at least in Matlab. Each thread/worker should be independent of each other. However, as indicated in this question, you could try job and task control interface to give you some level of control.
Firstly, in practice, PARFOR isn't "entirely random" - you can easily observe that it sends out chunks of loop iterates in reverse order. In R2013b and later, if you need more control over ordering (if, for example, you know that certain of your independent things are likely to take a long time, and therefore wish to start computing them first), you can use PARFEVAL.
If you need to loosely synchronize things, for instance wait until some thread as finished or has reach some point before to start another one, best should be to use semaphores, locks, mutex, etc...
I don't know if 'Parallel toolbox' includes such synchronization objects, but here is some workaround to create semaphore for instance:
https://stackoverflow.com/a/22874669/684399
You can also use objects in 'System.Threading' namespace (requires .NET):
Init:
someResultAvailable = System.Threading.ManualResetEvent(false);
In some job:
... do work ...
someResultAvailable .Set();
... continue ...
In another one:
... do work ...
if (!someResultAvailable.WaitOne(10000))
{
error('Timeout waiting for result from other thread');
}
... continue now knowing that results are available ...