I have a MATLAB program that I want to run in parallel so that it runs faster. However, when I do that parallel workers seem not to be able to access global variables created beforehand. Here is what my code looks like:
createData % a .m file that creates a global variable (Var)
parfor i:j
processData() % a function that is dependent on some global variables
end
However, I get an error message undefined function or variable Var. I've already included a call for global variables global Var inside the function processData() but this is not working either. Is there any way of making global variables visible within the parallel loop?
This is not the same question as here as I declared global variables outside of the parfor loop and want to access them within the loop with out the need to modify or update the its value across workers of the parallel loop.
The simplest advice is: don't use global for the myriad reasons already described/linked here. Ideally, you would restructure your code like so:
Var = createData(); % returns 'Var' rather than creating a global 'Var'
parfor idx = ...
% simply use 'Var' inside the parfor loop.
out(idx) = processData(Var, ...);
end
Note that parfor is smart enough to send Var to each worker exactly once for the above loop. However, it isn't smart enough not to send it across multiple times if you have multiple parfor loops. In that case, I would suggest using parallel.pool.Constant. How you use that depends on the cost of creating Var compared to its size. If it is small, but expensive to create - that implies you're best off creating it only once at the client and sending it to the workers, like this:
cVar = parallel.pool.Constant(Var);
If it is large, but relatively quick to construct, you could consider getting the workers each to construct their own copy independently, like this:
cVar = parallel.pool.Constant(#createData); % invokes 'createData' on each worker
Citing the author of the parallel toolbox:
GLOBAL data is hard to use inside PARFOR because each worker is a separate MATLAB process, and global variables are not synchronised from the client (or any other process) to the workers.
Emphasis mine. So the only way to get a global variable on a worker (which is a bad idea for reasons mentioned in the linked post) is to write a function which sets up the global variables, run that on each worker, then run your own, global-dependent function.
Citing another comment of mine to illustrate why this is a bad idea:
One of the pitfalls in terms of good practise is that you can suddenly overwrite a variable which is used inside a function in other functions. Therefore it can be difficult to keep track of changes and going back and forth between functions might cause unexpected behaviour because of that. This happens especially often if you call your global variables things like h, a etc (this of course makes for bad reading also when the variable is not global)
And finally an article outlining most of the reasons using global variables is generally a bad idea.
Bottom line: what you want is not possible, and generally thought to be bad practise.
Related
First, I have had a look at this excellent article already.
I have a MATLAB script, called sdp. I have another MATLAB script called track. I run track after sdp, as track uses some of the outputs from sdp. To run track I need to call a function called action many many times. I have action defined as a function in a separate MATLAB file. Each call of this action has some inputs, say x1,x2,x3, but x2,x3are just "data" which will never change. They were the same in sdp, same in track, and will remain the same in action. Here, x2,x3 are huge matrices. And there are many of them (think like x2,x3,...x10)
The lame way is to define x2,x3 as global in sdp and then in track, so I can call action with only x1. But this slows down my performance incredibly. How can I call action again and again with only x1 such that it remembers what x2,x3 are? Each call is very fast, and if I do this inline for example, it is super fast.
Perhaps I can use some persistent variables. But I don't understand exactly if they are applicable to my example. I don't know how to use them exactly either.
Have a look at object oriented programming in Matlab. Make an action object where you assign the member variables x2 ... to the results from sdp. You can then call a method of action with only x1. Think of the object as a function with state, where the state information in your case are the constant results of sdp.
Another way to do this would be to use a functional approach where you pass action to track as a function handle, where it can operate on the variables of track.
Passing large matrices in MATLAB is efficient. Semantically it uses call-by-value, but it's implemented as call-by-reference until modified. Wrap all the unchanging parameters in a struct of parameters and pass it around.
params.x2 = 1;
params.x3 = [17 39];
params.minimum_velocity = 19;
action('advance', params);
You've already discovered that globals don't perform well. Don't worry about the syntactic sugar of hiding variables somewhere... there are advantages to clearly seeing where the inputs come from, and performance will be good.
This approach also makes it easy to add new data members, or even auxiliary metadata, like a description of the run, the time it was executed, etc. The structs can be combined into arrays to describe multiple runs with different parameters.
I hate using global variables, and everyone should. If a language has no way around using global variables it should be updated. Currently, I don't know any good alternative to using global variables in Matlab, when efficiency is the goal.
Sharing data between callbacks can be done in only 4 ways that I am aware of:
nested functions
getappdata (what guidata uses)
handle-derived class objects
global variables
nested functions forces the entire project to be in a single m-file, and handle-derived class objects (sent to callbacks), gives unreasonable overhead last I checked.
Comparing getappdata/guidata with global variables, in a given callback you can write(assuming uglyGlobal exists as a 1000x1000 mat):
global uglyGlobal;
prettyLocal = uglyGlobal;
prettyLocal(10:100,10:100) = 0;
uglyGlobal = prettyLocal;
or you can write (assuming uglyAppdata exists as a 1000x1000 mat):
prettyLocal = getappdata(0,'uglyAppdata');
prettyLocal(10:100,10:100) = 0;
setappdata(0,'x',prettyLocal);
The above snippets should work in the same way. It could be (but is not guaranteed) more efficient with just:
global uglyGlobal;
uglyGlobal(10:100,10:100) = 0;
This snippet, unlike the previous ones, may not trigger a copy-on-write in Matlab. The data in the global workspace is modified, and (potentially) only there.
however, if we do the innocent modification:
global uglyGlobal;
prettyLocal = uglyGlobal;
uglyGlobal(10:100,10:100) = 0;
Matlab will ensure that prettyLocal gets its own copy of the data. The third line above will show up as the culprit when you profile. To make that worse on my brain(globals tend to do that), any other workspace that exists that has a local reference to the global, will make a copy-on-write trigger for that variable, one for each.
However, iff one makes sure no local references exists:
Is it true that global variables, used carefully can yield the best performance programs in Matlab?
Note: I would provide som timing results, but unfortunately I no longer have access to Matlab.
I have to work with a lot of data and run the same MATLAB program more than once, and every time the program is run it will store the data in the same preset variables. The problem is, every time the program is run the values are overwritten and replaced, most likely because all the variables are type double and are not a matrix. I know how to make a variable that can store multiple values in a program, but only when the program is run once.
This is the code I am able to provide:
volED = reconstructVolume(maskAlignedED1,maskAlignedED2,maskAlignedED3,res)
volMean = (volED1+volED2+volES3)/3
strokeVol = volED-volES
EF = strokeVol/volED*100
The program I am running depends on a ton more MATLAB files that I cannot provide at this moment, however I believe the double variables strokeVol and EF are created at this instant. How do I create a variable that will store multiple values and keep adding the values every time the program is run?
The reason your variables are "overwritten" with each run is that every function (or standalone program) has its own workspace where the local variables are located, and these local variables cease to exist when the function (or standalone program) returns/terminates. In order to preserve the value of a variable, you have to return it from your function. Since MATLAB passes its variables by value (rather than reference), you have to explicitly provide a vector (or more generally, an array) as input and output from your function if you want to have a cumulative set of data in your calling workspace. But it all depends on whether you have a function or a deployed program.
Assuming your program is a function
If your function is now declared as something like
function strokefraction(inputvars)
you can change its definition to
function [EFvec]=strokefraction(inputvars,EFvec)
%... code here ...
%volES initialized somewhere
volED = reconstructVolume(maskAlignedED1,maskAlignedED2,maskAlignedED3,res);
volMean = (volED1+volED2+volES3)/3;
strokeVol = volED-volES;
EF = strokeVol/volED*100;
EFvec = [EFvec; EF]; %add EF to output (column) vector
Note that it's legal to have the same name for an input and an output variable. Now, when you call your function (from MATLAB or from another function) each time, you add the vector to its call, like this:
EFvec=[]; %initialize with empty vector
for k=1:ndata %simulate several calls
inputvar=inputvarvector(k); %meaning that the input changes
EFvec=strokefraction(inputvar,EFvec);
end
and you will see that the size of EFvec grows from call to call, saving the output from each run. If you want to save several variables or arrays, do the same (for arrays, you can always introduce an input/output array with one more dimension for this purpose, but you probably have to use explicit indexing instead of just shoving the next EF value to the bottom of your vector).
Note that if your input/output array eventually grows large, then it will cost you a lot of time to keep allocating the necessary memory by small chunks. You could then choose to allocate the EFvec (or equivalent) array instead of initializing it to [], and introduce a counter variable telling you where to overwrite the next data points.
Disclaimer: what I said about the workspace of functions is only true for local variables. You could also define a global EFvec in your function and on your workspace, and then you don't have to pass it in and out of the function. As I haven't yet seen a problem which actually needed the use of global variables, I would avoid this option. Then you also have persistent variables, which are basically globals with their scope limited to their own workspace (run help global and help persistent in MATLAB if you'd like to know more, these help pages are surprisingly informative compared to usual help entries).
Assuming your program is a standalone (deployed) program
While I don't have any experience with standalone MATLAB programs, it seems to me that it would be hard to do what you want for that. A MathWorks Support answer suggests that you can pass variables to standalone programs, but only as you would pass to a shell script. By this I mean that you have to pass filenames or explicit numbers (but this makes sense, as there is no MATLAB workspace in the first place). This implies that in order to keep a cumulative set of output from your program you would probably have to store those in a file. This might not be so painful: opening a file to append the next set of data is straightforward (I don't know about issues such as efficiency, and anyway this all depends on how much data and how many runs of your function we're talking about).
I want my script to run faster, so the purpose is, to use my cores simultaneously. The problem is that i somehow miss a layer of globality. I want my globals to be persistent within some functions, but i want them to be different in each loop-call.
what i want to do:
parfor i:T
createData() % global variables are being created
useData() % several functions need access to global vars
end
I thankful for any idea, to make this loop work simultaneously, subject to keeping my variables global. Thanks for your advise :)
Had the same issue; I was unable to use Global variables within the parallel loop (parfor or using spmd) as they turned out as empty when triggered.
Instead of rewriting the entire code, I did a quick fix by storing the needed Global variables before the parallel pool and then loading them in the relevant functions if they are empty. In this way I keep my Global variables logic and only load them if they are in a parallel pool.
% Store global variables to be reused in parallel workers
global Var1
save('temp_global_parallel','Var1');
% Parallel pool functions
parpool(4)
spmd
someFunctions();
anotherFunction();
end
% Optionally: delete to avoid the bug as explained below
delete('temp_global_variable');
And then inside the function that uses the Global variable:
global Var1
if isempty(Var1)
load('temp_global_parallel')
end
Caution: The disadvantage is off course that if the Global variable is really empty, then you would not detect it. You could solve this by deleting the .mat file after the parallel loop.
Second caution: I would not recommend storing big variables (in any case do not them as Global variables) as this might lower the speed significantly in each loop. Storing and loading variables is bad practice in general. Instead, try to store only things like constants or some parameters. In my case I was storing a string with the current path extension (which is less than 1kb).
I came accross this sentence in MATLAB doc:
The body of a parfor-loop cannot make reference to a nested function. However, it can call a nested function by means of a function handle.
Can someone please explain what this means?
A parfor loop is different from a normal loop, in that the body of the loop has its independent workspace for every iteration. In fact, when you are running the parfor loop on a parallel pool, the variables that need to be transmitted to the loop body are saved and reloaded (that's, by the way, the reason for the "variable x cannot be sliced which may lead to communication overhead" warning: Having to save and reload huge variables may add quite a bit to your processing time).
Consequently, calls to nested functions won't work - the nested function in the parent function no longer shares its workspace with the loop body. Furthermore, nested function calls may alter workspace variables across iterations of a loop, which won't mesh with parallel execution.
In contrast, passing a function handle, or calling a separate function, works fine. The function defined in the function handle, as well as the separate function, have their own workspaces, nothing gets shared across iterations of the parfor body, and thus the iterations can run completely independently.
/aside: Creating a function handle to a nested function may still be able to cause you problems: a live function (as opposed to a function handle stored as string which you "activate" with str2func) handle can carry quite a bit of the existing workspace, including handle objects. Both the size of the workspace and the not-being-passed-by-reference (because of save&reload) may lead to unhappiness.