Is appdata shared between workers in a parallel pool? - matlab

I'm working on a complicated function that calls several subfunctions (within the same file). To pass data around, the setappdata/getappdata mechanism is used occasionally. Moreover, some subfunctions contain persistent variables (initialized once in order to save computations later).
I've been considering whether this function can be executed on several workers in a parallel pool, but became worried that there might be some unintended data sharing (which would otherwise be unique to each worker).
My question is - how can I tell if the data in global and/or persistent and/or appdata is shared between the workers or unique to each one?
Several possibly-relevant things:
In my case, tasks are completely parallel and their results should not affect each other in any way (parallelization is done simply to save time).
There aren't any temporary files or folders being created, so there is no risk of one worker mistakenly reading the files that were left by another.
All persistent and appdata-stored variables are created/assigned within subfunction of the parfor.
I know that each worker corresponds to a new process with its own memory space (and presumably, global/persistent/appdata workspace). Based on that and on this official comment, I'd say it's probable that such sharing does not occur... But how do we ascertain it?
Related material:
This Q&A.
This documentation page.

This is quite straightforward to test, and we shall do it in two stages.
Step 1: Manual Spawning of "Workers"
First, create these 3 functions:
%% Worker 1:
function q52623266_W1
global a; a = 5;
setappdata(0, 'a', a);
someFuncInSameFolder();
end
%% Worker 2:
function q52623266_W2
global a; disp(a);
disp(getappdata(0,'a'));
someFuncInSameFolder();
end
function someFuncInSameFolder()
persistent b;
if isempty(b)
b = 10;
disp('b is now set!');
else
disp(b);
end
end
Next we boot up 2 MATLAB instances (representing two different workers of a parallel pool), then run q52623266_W1 on one of them, wait for it to finish, and run q52623266_W2 on the other. If data is shared, the 2nd instance will print something. This results (on R2018b) in:
>> q52623266_W1
b is now set!
>> q52623266_W2
b is now set!
Which means that data is not shared. So far so good, but one might wonder whether this represents an actual parallel pool. So we can adjust our functions a bit and move on to next step.
Step 2: Automatic Spawning of Workers
function q52623266_Host
spmd(2)
if labindex == 1
setupData();
end
labBarrier; % make sure that the setup stage was executed.
if labindex == 2
readData();
end
end
end
function setupData
global a; a = 5;
setappdata(0, 'a', a);
someFunc();
end
function readData
global a; disp(a);
disp(getappdata(0,'a'));
someFunc();
end
function someFunc()
persistent b;
if isempty(b)
b = 10;
disp('b is now set!');
else
disp(b);
end
end
Running the above we get:
>> q52623266_Host
Starting parallel pool (parpool) using the 'local' profile ...
connected to 2 workers.
Lab 1:
b is now set!
Lab 2:
b is now set!
Which again means that data is not shared. Note that in the second step we used spmd, which should function similarly to parfor for the purposes of this test.

There is another not-sharing-of-data that bit me.
Persistent variables are not even copied from the current workspace to the workers.
To demonstrate, a simple function with a persistent variable is created (MATLAB 2017a):
function [ output_args ] = testPersist( input_args )
%TESTPERSIST Simple persistent variable test.
persistent var
if (isempty(var))
var = 0;
end
if (nargin == 1)
var = input_args;
end
output_args = var;
end
And a short script is executed:
testPersist(123); % Set persistent variable to 123.
tpData = zeros(100,1);
parfor i = 1 : 100
tpData(i) = testPersist;
testPersist(i);
end
any(tpData == 0) % This implies the worker started from 0 instead of 123 as specified in the first row.
Output is 1 - workers disregarded the 123 from the parent workspace and started anew.
Checking the values in tpData additionally shows how each worker did its job by noting say "tpData(14) = 15 - this means worker that completed 15 continued with 14 next"
So, creating a worker = creating completely new instance of MATLAB completely unrelated to the instance of MATLAB you have open in front of you. For every worker separately.
Lesson I gained from that = don't use simple persistent variables as the simulation config file. It worked fine and looked elegant as long as no parfor was used... but broke horribly afterwards. Use objects.

Related

Clear persistent variables in local functions from within the main function

I have a code which consists of a single file containing multiple functions, some of which use persistent variables. In order for it to work correctly, persistent variables must be empty.
There are documented ways to clear persistent variables in a multi-function file, such as:
clear functionName % least destructive
clear functions % more destructive
clear all % most destructive
Unfortunately, I cannot guarantee that the user remembers to clear the persistent variables before calling the function, so I'm exploring ways to perform the clearing operation at the beginning of the code. To illustrate the problem, consider the following example:
function clearPersistent(methodId)
if ~nargin, methodId = 0; end
switch methodId
case 0
% do nothing
case 1
clear(mfilename);
case 2
eval(sprintf('clear %s', mfilename));
case 3
clear functions;
case 4
clear all;
end
subfunction();
subfunction();
end
function [] = subfunction()
persistent val
if isempty(val)
disp("val is empty");
val = 123;
else
disp("val is not empty");
end
end
When first running this, we get:
>> clearPersistent
val is empty
val is not empty
I would expect that running the function again at this point, with any of the non-0 inputs would result in the val variable being cleared, but alas - this is not the case. After val is set, unless we use one of the alternatives shown in the top snippet externally, or modify the .m file, it remains set.
My question: Is it possible to clear persistent variable in subfunctions from within the body of the main function, and if yes - how?
In other words, I'm looking for some code that I can put in clearPersistent before calling the subfunctions, such that the output is consistently:
val is empty
val is not empty
P.S.
Here's a related past question (which doesn't deal with this specific use case): List/view/clear persistent variables in Matlab.
I'm aware of the possibility of rewriting the code to not use persistent variables at all (e.g. by passing data around, using appdata, adding a 'clear' flag to all subfunctions, etc.).
Please note that editing the source code of the function and saving implicitly clears it (along with all persistent variables).
I'm aware that the documentation states that "The clear function does not clear persistent variables in local or nested functions.
Additional background on the problem:
The structure of the actual code is as follows:
Main function (called once)
└ Global optimization solver (called once)
└ Objective function (called an unknown N≫1 times)
└ 1st function that uses persistents
└ 2nd function that uses persistents
As mentioned in the comments, there are several reasons why some variables were made persistent:
Loose coupling / SoC: The objective function does not need to be aware of how the subfunctions work.
Encapsulation: It is an implementation detail. The persistent variables do not need to exist outside the scope of the function that uses them (i.e. nobody else ever needs them).
Performance: The persistent variables contain matrices that are fairly expensive to compute, but this operation needs to happen only once per invocation of the main function.
One (side?) effect of using persistent variables is making the entire code stateful (with two states: before and after the expensive computations). The original issue stems from the fact that the state was not being correctly reset between invocations of the main function, causing runs to rely on a state created with previous (and thus invalid) configurations.
It is possible to avoid being stateful by computing the one-time values in the main function (which currently only parses user-supplied configurations, calls the solver, and finally stores/displays outputs), then passing them alongside the user configurations into the objective function, which would then pass them on to the subfunctions. This approach solves the state-clearing problem, but hurts encapsulation and increases coupling, which in turn might hurt maintainability.
Unfortunately, the objective function has no flag that says 'init' etc., so we don't know if it's called for the 1st or the nth time, without keeping track of this ourselves (AKA state).
The ideal solution would have several properties:
Compute expensive quantities once.
Be stateless.
Not pass irrelevant data around (i.e. "need to know basis"; individual function workspaces only contain the data they need).
clear fname and clear functions removes the M-file from memory. The next time you run the function, it is read from disk again, parsed and compiled into bytecode. Thus, you slow down the next execution of the function.
Clearing a function or sub-function from within a function thus does not work. You're running the function, you cannot clear its file from memory.
My solution would be to add an option to subfunction to clear its persistent variable, like so:
function clearPersistent()
subfunction('clear');
subfunction();
subfunction();
end
function [] = subfunction(option)
persistent val
if nargin>0 && ischar(option) && strcmp(option,'clear')
val = [];
return
end
if isempty(val)
disp("val is empty");
val = 123;
else
disp("val is not empty");
end
end
Of course you could initialize your value when called as subfunction('init') instead.
A different solution that might work for your usecase is to separate the computation of val and its use. I would find this easier to read than any of the other solutions, and would be more performant too.
function main()
val = computeval();
subfunction(val);
subfunction(val);
end
Given your edit, you could put the objective function in a separate file (in the private subdirectory). You will be able to clear it.
An alternative to persistent variables would be to create a user class with a constructor that computed the expensive state, and another method to compute the objective function. This could also be a classdef file in the private subdirectory. I think this is nicer because you won’t need to remember to call clear.
In both these cases you don’t have a single file containing all the code any more. I think you need to give up on one of those two ideals: either break data encapsulation or split the code across two files (code encapsulation?).
Why not using global variables?
You can create a global struct that contains your variables and it can be managed using a variable_manager:
function main
variable_manager('init')
subfunction1()
subfunction2()
end
function variable_manager(action)
global globals
switch action
case 'init'
globals = struct('val',[],'foo',[]);
case 'clear'
globals = structfun(#(x)[],globals,'UniformOutput', false);
% case ....
% ...
end
end
function subfunction1
global globals
if isempty(globals.val)
disp("val is empty");
globals.val = 123;
else
disp("val is not empty");
end
end
function subfunction2
global globals
if isempty(globals.foo)
disp("foo is empty");
globals.foo = 321;
else
disp("foo is not empty");
end
end
As mentioned in the question, one of the possibilities is using appdata, which is not too different from global (at least when associating them with "object 0" - which is the MATLAB instance itself). To avoid "collisions" with other scripts/functions/etc. we introduce a random string (if we generate a string in every function that uses this storage technique, it would almost certainly guarantee no collisions). The main downside of this approach is that the string has to be hard-coded in multiple places, or the structure of the code should be changed such that the functions that use this appdata are nested within the function that defines it.
The two ways to write this are:
function clearPersistent()
% Initialization - clear the first time:
clearAppData();
% "Business logic"
subfunction();
subfunction();
% Clear again, just to be sure:
clearAppData();
end % clearPersistent
function [] = subfunction()
APPDATA_NAME = "pZGKmHt6HzkkakvdfLV8"; % Some random string, to avoid "global collisions"
val = getappdata(0, APPDATA_NAME);
if isempty(val)
disp("val is empty");
val = 123;
setappdata(0, APPDATA_NAME, val);
else
disp("val is not empty");
end
end % subfunction
function [] = clearAppData()
APPDATA_NAME = "pZGKmHt6HzkkakvdfLV8";
if isappdata(0, APPDATA_NAME)
rmappdata(0, APPDATA_NAME);
end
end % clearAppData
and:
function clearPersistent()
APPDATA_NAME = "pZGKmHt6HzkkakvdfLV8";
% Initialization - clear the first time:
clearAppData();
% "Business logic"
subfunction();
subfunction();
% Clear again, just to be sure:
clearAppData();
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [] = subfunction()
val = getappdata(0, APPDATA_NAME);
if isempty(val)
disp("val is empty");
val = 123;
setappdata(0, APPDATA_NAME, val);
else
disp("val is not empty");
end
end % subfunction
function [] = clearAppData()
if isappdata(0, APPDATA_NAME)
rmappdata(0, APPDATA_NAME);
end
end % clearAppData
end % clearPersistent

Pause JModelica and Pass Incremental Inputs During Simulation

Hi Modelica Community,
I would like to run two models in parallel in JModelica but I'm not sure how to pass variables between the models. One model is a python model and the other is an EnergyPlusToFMU model.
The examples in the JModelica documentation has the full simulation period inputs defined prior to the simulation of the model. I don't understand how one would configure a model that pauses for inputs, which is a key feature of FMUs and co-simulation.
Can someone provide me with an example or piece of code that shows how this could be implemented in JModelica?
Do I put the simulate command in a loop? If so, how do I handle warm up periods and initialization without losing data at prior timesteps?
Thank you for your time,
Justin
Late answer, but in case it is picked up by others...
You can indeed put the simulation into a loop, you just need to keep track of the state of your system, such that you can re-init it at every iteration. Consider the following example:
Ts = 100
x_k = x_0
for k in range(100):
# Do whatever you need to get your input here
u_k = ...
FMU.reset()
FMU.set(x_k.keys(), x_k.values())
sim_res = FMU.simulate(
start_time=k*Ts,
final_time=(k+1)*Ts,
input=u_k
)
x_k = get_state(sim_res)
Now, I have written a small function to grab the state, x_k, of the system:
# Get state names and their values at given index
def get_state(fmu, results, index):
# Identify states as variables with a _start_ value
identifier = "_start_"
keys = fmu.get_model_variables(filter=identifier + "*").keys()
# Now, loop through all states, get their value and put it in x
x = {}
for name in keys:
x[name] = results[name[len(identifier):]][index]
# Return state
return x
This relies on setting "state_initial_equations": True compile option.

What is the benefit of automatic variables?

I'm looking for benefits of "automatic" in Systemverilog.
I have been seeing the "automatic" factorial example. But I can't get though them. Does anyone know why we use "automatic"?
Traditionally, Verilog has been used for modelling hardware at RTL and at Gate level abstractions. Since both RTL and Gate level abstraction are static/fixed (non-dynamic), Verilog supported only static variables. So for example, any reg or wire in Verilog would be instantiated/mapped at the beginning of simulation and would remain mapped in the simulation memory till the end of simulation. As a result, you can take dump of any wire/reg as a waveform, and the reg/wire would have a value from the beginning till the end, since it is always mapped. In a programmers perspective, such variables are termed static. In C/C++ world, to declare such a variable, you will have to use storage class specifier static. In Verilog every variable is implicitly static.
Note that until the advent of SystemVerilog, Verilog supported only static variables. Even though Verilog also supported some constructs for modelling at behavioural abstraction, the support was limited by absence of automatic storage class.
automatic (called auto in software world) storage class variables are mapped on the stack. When a function is called, all the local (non-static) variables declared in the function are mapped to individual locations in the stack. Since such variables exist only on the stack, they cease to exist as soon as the execution of the function is complete and the stack correspondingly shrinks.
Amongst other advantages, one possibility that this storage class enables is recursive functions. In Verilog world, a function can not be re-entrant. Recursive (or re-entrant) functions do not serve any useful purpose in a world where automatic storage class is not available. To understand this, you can imagine a re-entrant function as a function which dynamically makes multiple recursive instantiations of itself. Each instance gets its automatic variables mapped on the stack. As we progress into the recursion, the stack grows and each function gets to make its computations using its own set of variables. When the function calls return the computed values are collated and a final result made available. With only static variables, each function call will store the variable values at the same common locations thus erasing any benefit of having multiple calls (instantiations).
Coming to the factorial algorithm, it is relatively easy to conceptualize factorial as a recursive algorithm. In maths we write factorial(n) = n(factial(n-1))*. So you need to calculate factorial(n-1) in order to know factorial(n). Note that recursion can not be completed without a terminating case, which in case of factorial is n=1.
function automatic int factorial;
input int n;
if (n > 1)
factorial = factorial (n - 1) * n;
else
factorial = 1;
endfunction
Without automatic storage class, since all variables in a function would be mapped to a fixed location, when we call factorial(n-1) from inside factorial(n), the recursive call would overwrite any variable inside the caller context. In the factorial function as defined in the above code snippet, if we do not specify the storage class as automatic, both n and the result factorial would be overwritten by the recursive call to factorial(n-1). As a result the variable n would consecutively be overwritten as n-1, n-2, n-3 and so on till we reach the terminating condition of n = 1. The terminating recursive call to factorial would have a value of 1 assigned to n and when the recursion unwinds, factorial(n-1) * n would evaluate to 1 in each stage.
With automatic storage class, each recursive function call would have its own place in the memory (actually on the stack) to store variable n. As a result consecutive calls to factorial will not overwrite variable n of the caller. As a result when the recursion unwinds, we shall have the right value for factorial(n) as n*(n-1)(n-2) .. *1.
Note that it is possible to define factorial using iteration as well. And that can be done without use of automatic storage class. But in many cases, recursion makes it possible for the user to code algorithms in a more intuitive fashion.
I propose 1 example as below(using fork...join_none):
Ex.1 (non-using automatic) : value output will be "3 3 3 3". because i take the latest value after exit for loop, i is stored in static memory location. This may be a bug in your code.
initial begin
for( int i =0; i<=3 ; i++)
fork
$write ("%d ", i);
join_none
end
Ex.2 (using automatic) : value output will be "0 1 2 3". Because in each loop, value of i is copied to k, and fork..join_none spawn a thread with each value of k ( each loop will locate 1 memory space for k : k0, k1, k2, k3):
initial begin
for( int i =0; i<=3 ; i++)
fork
automatic int k = i;
$write ("%d ", k);
join_none
end
Another example is the use of fork join inside for loop -
Without the use of automatic, the fork join inside the for loop will not work correctly.
for (int i=0; i<`SOME_VALUE ; i++) begin
automatic int id=i;
fork
task/function using the id above ;
...
join_none
end

MATLAB: Parpool shared data access

I use:
if isempty(gcp('nocreate'))
parpool([ 1, Inf ]);
end
... to create a parpool in my wrapper function wrapper.m, which gives me 4 workers on my desktop. wrapper.m calls a file foo.m, which in turn calls bar.m several times.
The wrapper function generates heavy data, which is required in bar.m on a purely read-only basis:
wrapper.m:
genSpline = griddedInterpolant({ gridData.xgv, gridData.ygv, gridData.zgv }, ...
gridData.data, 'spline', 'spline');
bar.m:
val = genSpline(interPts);
When passed as an argument to bar.m via foo.m, each worker in the pool maintains its own private copy of genSpline, causing an enormous memory leak thanks to redundant data. However, the program works fine as such.
In an effort to work around this, I prefixed the def and use of gridData and genSpline with:
global gridData genSpline;
... as the documentation seems to suggest. However, this fails with:
'Subscript indices must either be real positive integers or logicals.'
... in bar.m. Reverting to passing via arguments proves that there is nothing wrong with interPts. Printlining the def and use of the version with the global variable gives this:
wrapper.m:
genSpline =
griddedInterpolant with properties:
GridVectors: {[1x41 double] [1x41 double] [1x12 double]}
Values: [41x41x12 double]
Method: 'spline'
ExtrapolationMethod: 'spline'
bar.m:
genSpline =
[]
... implying that either the global variable isn't being set properly, or for some reason is inaccessible to bar.m. There is no distributed network involved, and all files are within the same directory, which is on the MATLAB (R2014a 64-bit UNIX) path. Any suggestions?
PS: The same approach towards declaring and using global variables works with a 'regular' 2x2 matrix.
The MATLAB workers in a parallel pool are completely separate processes, and as such you cannot share data from the client simply by declaring it to be global. You might be able to use a shared matrix to solve this.

Variables in a vector change back after the call stack returns

I am using a recursive call in a tree in matlab, the basic structure of the function is here:
function recursion(tree, targetedFeatures)
if (some conditions fulfilled)
return;
end
for i = 1:1:size(targetedFeatures,2)
.....
.....
if (some conditions that using index i is true)
targetedFeatures(1,i) = 1;
end
end
if(tree has child nodes)
recursion(tree.child(j).targetedFeatures)
end
end
The structure of the tree is like this:
root
/ | \
/ | \
/ | \
leaf1 leaf2 leaf3
The input parameter of function recursion is a vector named targetedFeatures, assume its initial values is [0 0 0], and in the process of visiting leaf1, the vector is changed to [1 0 0], BUT when visiting to leaf2, the targetedFeature changed back to [0 0 0].
I suspect it is because vector in matlab does not like an reference to object in other programming language?
How can I avoid this issue? Thanks.
Matlab uses call-by-value for normal types of variables, see here. A way to work around it is to let the function return the modified copy as an output argument:
function targetedFeatures = recursion(tree, targetedFeatures)
...
targetedFeatures = recursion(tree.child(j).targetedFeatures);
...
end
Instead, call-by-reference might be simulated by using evalin('caller', ...) and inputname.
When the recursion function needs to modify targetedFeatures, a copy of targetedFeatures is created which is local to that function call. If you want your updates to be communicated back to the calling scope, then you will need to return the updated targetedFeatures from your function.
function targetedFeatures = recursion(tree, targetedFeatures)
if (some conditions fulfilled)
return;
end
for i = 1:1:size(targetedFeatures,2)
.....
.....
if (some conditions that using index i is true)
targetedFeatures(1,i) = 1;
end
end
if(tree has child nodes)
targetedFeatures = recursion(tree.child(j).targetedFeatures)
end
end
This is not nearly as effective as doing things with pointers like you might do in C for example, but you should not see a significant performance hit on what your code is already doing, since you are already creating local copies whenever you update targetedFeatures.
Thanks to chappjc for providing a link to this post which discusses the copy-on-write mechanism.
Depending on the length and eventually depth of your tree the return-based solutions above quickly become pretty ugly, since you'll always have to change the root node while in principle you'd only like to change one of many leafs.
Instead, you might want to look into implementing a handle class for the TreeNode object.
This would start off with something as simple as:
classdef TreeNode < handle
properties
targetedFeatures;
child; % vector keeping handles to TreeNode children
parent; % handle of the parent node, of which this node is a child
end
methods
...
end
end
You'd obviously have to fill in methods to add/remove children etc.
With such a tree you can recurse down to the deepest leaf and change its value without the need to carry around a reference to the top level root node all the time.
Once you have this in place, you should be able to use your function without modification.
An implementation of a somewhat similar class is one for linked lists, demonstrated in the MATLAB docs:
http://www.mathworks.de/de/help/matlab/matlab_oop/example--implementing-linked-lists.html
Here every node has a previous and next "child", instead of a parent and multiple children, but the general structure is pretty similar.
If you're planning to have a lot of other operations on this tree, like adding/removing nodes, searching, etc. it will definitely be worth it at some point.
If you just happened to come across that tree and you're done once you fixed this single issue, than go for the return-based solutions.