I am in the middle of instrumenting a fairly large-sized code with OpenACC. Right now, I am delaing with a routine foo that calls a few other routines bar, far, and boo, like so:
subroutine foo
real x(100,25),y(100,25),z(100,25)
real barout(25), farout(25), booout(25)
do i=1,25
call bar(barout, x(1,i),y(1,i),z(1,i))
call far(farout, x(1,i),y(1,i),z(1,i))
call boo(booout, x(1,i),y(1,i),z(1,i))
enddo
....
end subroutine foo
Couple of points: 1) x, y, and z stay constant through the loop. 2) You might not like the structure of the code here, but that is beyond my job description. I am supposed to instrument with OpenACC, period.
I am currently concentrating on the call to "bar". I want to make bar a vector routine. I am not ready to do the same for far and boo. So I would like to call bar from within a parallel region, but I am not ready to do the same with far and boo. (I said this is a work in progress, right?) Now, I could -- I think! -- sandwich bar in its own parallel region and copy data to and from it in each loop iteration
!$acc data copy(barout) &
!$acc& copyin(x(:,:),y(:,:),z(:,:))
!$acc parallel
call bar( .... )
!$acc en parallel
!$acc end data
But that's alot of data transfer. It would be great if I could transfer x,y, and z to the device just once. Each of the routines has their own data regions, so as I understand it (Please correct me if I am wrong!) I cannot encase the entire loop in a single data region. Here was an alternative I tried
subroutine foo
!$acc routine(bar) vector
real x(100,25),y(100,25),z(100,25)
real barout(25), farout(25), booout(25)
!$acc data create(x(:,:),y(:,:),z(:,:))
!$acc end data
do i=1,25
!$acc data copy(barout(:)) &
!$acc& present(x(:,:),y(:,:),z(:,:))
!$acc parallel
call bar(barout, x(1,i),y(1,i),z(1,i))
!$acc end parallel
!$acc end data
call far(farout, x(1,i),y(1,i),z(1,i))
call boo(booout, x(1,i),y(1,i),z(1,i))
enddo
....
end subroutine foo
But this doesn't work because the data in the copyin doesn't persist on the device. It is gone when the data present clause appears. (I've tried data create as well as data copyin.)
So is there a way to do what I am trying to do here? Thanks.
Have the outer data region span across the "i" loop. As you have it, you have "end data" directly after the start so x, y, and z are deleted before the "i" loop and not present. I'd also recommend using update clauses within the loop to manage the data transfers. Something like:
subroutine foo
!$acc routine(bar) vector
real x(100,25),y(100,25),z(100,25)
real barout(25), farout(25), booout(25)
!$acc data copyin(x, y, z), create(barout)
do i=1,25
!$acc update device(barout)
!$acc parallel
call bar(barout, x(1,i),y(1,i),z(1,i))
!$acc end parallel
!$acc update host(barout)
call far(farout, x(1,i),y(1,i),z(1,i))
call boo(booout, x(1,i),y(1,i),z(1,i))
enddo
!$end data
....
end subroutine foo
Notes:
Since "bar" is a vector routine, calling it from a "parallel" region means that you'll only be using a single gang. It's not wrong code, but you will lose performance. It might be better to keep it as a host routine and then put the "parallel" inside of "bar" so you can use both "gang" and "vector". Granted, if your intent is to later move the inner "parallel" region to a "parallel loop gang" around the "i" loop, then it would make sense to leave it as is.
I changed your code to copyin x, y, and z since I wasn't sure where these variables get initialized. If they are initialized in "bar", you can change these back to use "create", but then add update directives to synchronize the host and device copies.
Related
I am running a very large meta-simulation where I go through two hyperparameters (lets say x and y) and for each set of hyperparameters (x_i & y_j) I run a modest sized subsimulation. Thus:
for x=1:I
for y=1:j
subsimulation(x,y)
end
end
For each subsimulation however, about 50% of the data is common to every other subsimulation, or subsimulation(x_1,y_1).commondata=subsimulation(x_2,y_2).commondata.
This is very relevant since so far the total simulation results file size is ~10Gb! Obviously, I want to save the common subsimulation data 1 time to save space. However, the obvious solution, being to save it in one place would screw up my plotting function, since it directly calls subsimulation(x,y).commondata.
I was wondering whether I could do something like
subsimulation(x,y).commondata=% pointer to 1 location in memory %
If that cant work, what about this less elegant solution:
subsimulation(x,y).commondata='variable name' %string
and then adding
if(~isstruct(subsimulation(x,y).commondata)),
subsimulation(x,y).commondata=eval(subsimulation(x,y).commondata)
end
What solution do you guys think is best?
Thanks
DankMasterDan
You could do this fairly easily by defining a handle class. See also the documentation.
An example:
classdef SimulationCommonData < handle
properties
someData
end
methods
function this = SimulationCommonData(someData)
% Constructor
this.someData = someData;
end
end
end
Then use like this,
commonData = SimulationCommonData(something);
subsimulation(x, y).commondata = commonData;
subsimulation(x, y+1).commondata = commonData;
% These now point to the same reference (handle)
As per my comment, as long as you do not modify the common data, you can pass it as third input and still not copy the array in memory on each iteration (a very good read is Internal Matlab memory optimizations). This image will clarify:
As you can see, the first jump in memory is due to the creation of common and the second one to the allocation of the output c. If the data were copied on each iteration, you would have seen many more memory fluctuations. For instance, a third jump, then a decrease, then back up again and so on...
Follows the code (I added a pause in between each iteration to make it clearer that no big jumps occur during the loop):
function out = foo(a,b,common)
out = a+b+common;
end
for ii = 1:10; c = foo(ii,ii+1,common); pause(2); end
I have three looped operations O1 O2 O3 each with an IF statement and the operation with the largest flag=[F1 F2 F3] value has a higher priority to run.
How can I switch between operations depending on the value of that flag ? The flag value for each operation varies with time.
For simplicity, operation 1 is going to run first, and by the end of it's loop the flag value will be the lowest, hence operation 2 or 3 should run next. So for this example, at t=0 : F1=5 F2=3 and F3=1.
The over-simplified pseudo code for what im trying to achieve :
while 1
find largest flag value using [v index]=max(flag)
Run operation with highest flag value
..loop back..
end
I am not sure how the value of flag will be compared in between operations, and hence why I hope for someone to shed some light on the issue here.
EDIT
Currently, all operations are written in one matlab file, and each is triggered with an IF statement. The operations run systematically one after the other (depending on which one is written first in matlab). I want to avoid that and trigger them depending on the flag value instead.
If your operations are functions (a little hard to tell from the question), then make a cell array of function handles, there fun1 is the name of one of your actual functions.
handles = {#fun1, #fun2, #fun3}
Then you can just use the index returned from your max term to get the correct function from the array. You can pass any arguments to the function using the following syntax.
handles{index}(args)
Using the style above makes the solution scalable, so you don't need a stack of if statements that require maintenance when the number of operations expands. If the functions are really simple you can always use lambdas (or anonymous functions in Matlab speak).
However, if you have a limited number of simple operations that are not likely to expand, you may choose to use a switch statement in your while loop instead. It conveys your intention better than a stack of if statements.
while 1
[~, index]=max(flag);
switch index
case 1
operation1
flag = [x y z]
case 2
operation2
flag = [x y z]
otherwise
operation3
flag = [x y z]
end
end
In order to test an algorithm in different scenarios, in need to iteratively call a matlab function alg.m.
The bottleneck in alg.m is something like:
load large5Dmatrix.mat
small2Dmatrix=large5Dmatrix(:,:,i,j,k) % i,j and k change at every call of alg.m
clear large5Dmatrix
In order to speed up my tests, i would like to have large5Dmatrix loaded only at the first call of alg.m, and valid for future calls, possibly only within the scope of alg.m
Is there a way to acheve this in matlab other then setting large5Dmatrix as global?
Can you think of a better way to work with this large matrix of constant values within alg.m?
You can use persistent for static local variables:
function myfun(myargs)
persistent large5Dmatrix
if isempty(large5Dmatrix)
load large5Dmatrix.mat;
end
small2Dmatrix=large5Dmatrix(:,:,i,j,k) % i,j and k change at every call of alg.m
% ...
end
but since you're not changing large5Dmatrix, #High Performance Mark answer is better suited and has no computational implications. Unless you really, really don't want large5Dmatrix in the scope of the caller.
When you pass an array as an argument to a Matlab function the array is only copied if the function updates it, if the function only reads the array then no copy is made. So any performance penalty the function pays, in time and space, should only kick in if the function updates the large array.
I've never tested this with a recursive function but I don't immediately see why it should start copying the large array if it is only read from.
So your strategy would be to load the array outside the function, then pass it into the function as an argument.
This note may clarify.
i want fsolve to calculate the output for different uc each time (increasing uc by 0.001 each time). each output from fsolve should be sent to a simulink model seperatly. so i set a loop to do so, but i believe that at the currenty constellation (if it will work)will just calculate 1000 different values? is there a way to send out the values seperately?
if not, how can i create a parameter uc. that goes from 0 to say 1000? i tried uc=0:0.001:1000, but again, the demension doen't seem to fit.
how do i create a function that takes the next element of a vector/matrix each time the function is called?
best regards
The general approach to iterating over an array of values and feeding them one-by-one into a series of evaluations of a function follows this form:
for ix = 0:0.1:10
func(arg1, arg2, ix)
end
See how each call to func includes the current value of ix ? On the first iteration ix==0, on the next ix==0.1 and so forth. You should be able to adapt this to your needs; in your code the loop index (which you call i) is not used inside the loop.
Now some un-asked-for criticism of your code. The lines
x0=[1,1,1];
y=x0(1);
u=x0(2);
yc=x0(3);
options=optimset('Display','off');
do not change as the loop iterations advance; they always return the same values whatever the value of the loop iterator (i in your code) may be. It is pointless including them inside the loop.
Leaving them inside the loop may even be a waste of a lot of time if Matlab decides to calculate them at every iteration. I'm not sure what Matlab does in this case, it may be smart enough to figure out that these values don't change at each iteration, but even if it does it is bad programming practice to write your code this way; lift constant expressions such as these out of loops.
It's not clear from the fragment you've posted why you have defined y, u and yc at all, they're not used anywhere; perhaps they're used in other parts of your program.
I have a function that returns a large vector and is called multiple times, with some logic going on between calls that makes vectorization not an option.
An example of the function is
function a=f(X,i)
a=zeros(size(X,1),1);
a(:)=X(:,i);
end
and I am doing
for i=1:n a=f(X,i); end
When profiling this (size(X,1)=5.10^5, n=100 ) times are 0.12s for the zeros line and 0.22s for a(:)=X(:,i) the second line. As expected memory is allocated at each call of f in the 'zeros' line.
To get rid of that line and its 0.12s, I thought of allocating the returned value just once, and passing it in as return space each time to an appropriate function g like so:
function a=g(X,i,a)
a(:)=X(:,i);
end
and doing
a=zeros(m,1);
for i=1:n a=g(X,i,a); end
What is surprising to me is that profiling inside g still shows memory being allocated in the same amounts at the a(:)=X(:,i); line, and the time taken is very much like 0.12+0.22s..
1)Is this just "lazy copy on write" because I am writing into a?
2)Going forward, what are the options?
-a global variable for a (messy..)?
-writing a matrix handle class (must I really?)
(The nested function way means some heavy redesigning to make a nesting function to which X is known (the matrix A with notations from that answer)..)
Perhaps this is a bit tangential to your question, but if this is a performance critical application, I think a good way to go is to rewrite your function as a mex file. Here is a quote from http://www.mathworks.com/support/tech-notes/1600/1605.html#intro,
The main reasons to write a MEX-file are:...
Speed; you can rewrite bottleneck computations (like for-loops) as a MEX-file for efficiency.
If you are not familiar with mex files, the link above should get you started. Converting your existing function to C/C++ should not be overly difficult. The yprime.c example included with MATLAB is similar to what you're trying to do, since it is iteratively being called to calculate the derivatives inside ode45, etc.