I am using Matlab and I want to utilize my 2 GPUs
I have a big matrix that does not fit into 1 GPU but half of the matrix fits into 1 GPU. So I want to chop the matrix into half and let each one of my 2 GPUs work on half of the matrix. What I did is the following
try
parpool('local',gpuDeviceCount)
end
spmd
gpuDevice
end
dp = 0.00001;
R = zeros(1,2);
parfor k=1:1:2
if k==1
M = gpuArray([dp:2*dp:10])
else
M = gpuArray([2*dp:2*dp:10])
end
R(k) = arrayfun(#(x) myfun,M);
end
My question is: how can I know that I indeed create 2 M vector (they are different) on each of my GPU? Is there a single built-in function to show this? Why we need the spmd gpuDevice end? Currently, I can feel that in terms of speed, the parfor seems to be 2 times faster than a regular for. But how can I confirm that indeed each GPU stores a different M? I don't know if it actually copied gpuArray([dp:2*dp:10]) twice and gpuArray([2*dp:2*dp:10]) twice?
And after this block of code is finished, I find that my vector M does not appear in the workspace. But other variables defined outside the parfor code are still in the workspace. If I just use for instead of parfor (but with a smaller size M so that it can fit into 1 GPU), after the for loop, the vector M is in the workspace and its type is gpuArray. Why if I use parfor, then after the parfor loop, those variables defined within the parfor loop are gone?
To answer some of your questions.
M is not present after the end of the for loop, because it is a local variable, they are present only on the worker. To make sure your code works as intended, I would suggest to close the parallel computing pool (leaving the parfor in your code) and then use the debugger to set a breakpoint and check M. This way you can check if your code does everything right and then enable the parallel pool. In case you need M for later computations, you have to make it sliced variable of it. Considering the constraints (see same link above), you can either use a 2d matrix or cell array which you initialize outside the parfor.
About the use of gpuDevice, according to the documentation it is not required but "To identify which GPU each worker is using, call gpuDevice inside an spmd block. The spmd block runs gpuDevice on every worker." Looks like you con't have to include it. source
Related
I have a code that solves a scientific problem with many different inputs/parameters. I'm using a parallel for loop to iterate through a range of parameters, and running into trouble with memory usage. I've done my best to put together a MWE that represents my code.
Basically, for each parameter combination I run a small loop over several different solver options. In my real code, this is changing solver tolerances and the equations used (we have a few different transformation which can help conditioning). Each computation is effectively a shooting method for a small ODE system (3 equations, but each is quite complicated and generally stiff), with an optimisation routine calling the ODE solver. This takes seconds/minutes to run each time, the the parallelisation overhead is negligible, and the speedup scales pretty much exactly with the number of cores.
To explain the code below, start with driver. First define some parameters (a and f in the MWE) and save them in a file. The filename gets passed around between functions. Then create the 3 (in this case) sets of solver parameters, which choose the ode solver, tolerance, and set of equations to use. Then enter the for loop, looping over some other parameter c, at each iteration using each of the sets of solver parameters to call the optimisation function. Finally, I save a temporary file with the results of each iteration (so I don't lose everything if the server goes down). These files are about 1kB, and I will only have around 10,000 of them, so the overall size is on the order of 10MB. After the main loop I recombine everything back into single vectors.
The equations function creates the actual differential equations to solve, this is done using a switch statement to choose which equations to return. The objectiveFunction function uses str2func to specific the ODE solver, calls equations to get the equations to solve, then solves them and computes an objective function value.
The problem is that there appears to be some sort of memory leak. After some time, on the order of days, the code slows down and finally gives an out-of-memory error (running on 48 cores with ~380GB memory available, ode15s gave the error). The increase in memory usage over time is fairly gradual, but is definitely there, and I can't figure out what is causing it.
The MWE with 10,000 values c takes quite a while to run (1,000 is probably sufficient actually), and the memory usage per worker does increase over time. I think the file loading/saving and job distribution cause quite a lot of overhead, unlike my actual code, but this doesn't affect memory usage.
My question is, what could be causing this slow increase in memory usage?
My ideas for what is causing the problem are:
Using str2func isn't great, should I use a switch instead and accept having to write the solvers into the code explicitly?
All the anonymous functions getting called all the time (in the ODE solver) are holding on to workspace data and not releasing it at the end of each parfor iteration
Suppressed warnings are causing issues: I suppress lots of ODE step size warnings (this shouldn't be a factor because the bug that means this causes issues was fixed in 2017a, and the server I use runs 2017b)
Something in fminbnd or ode15s is actually leaking memory
I can't come up with a way to get around 1 and 2 nicely and efficiently (both from a code performance and code writing point of view), and I doubt 3 or 4 are actually the problem.
Here is the driver function:
function [xi,mfv] = driver()
% a and f are used in all cases. In actual code these are defined in a
% separate function
paramFile = 'params';
a = 4;
f = #(x) 2*x;
% this filename (params) gets passed around from function to function
save('params.mat','a','f')
% The struct setup has specifc options for the each iteration
setup(1).method = 'ode45'; % any ODE solver can be used here
setup(1).atol = 1e-3; % change the ODE solver tolerance
setup(1).eqs = 'second'; % changes what equations are solved
setup(2).method = 'ode15s';
setup(2).atol = 1e-3;
setup(2).eqs = 'second';
setup(3).method = 'ode15s';
setup(3).atol = 1e-4;
setup(3).eqs = 'first';
c = linspace(0,1);
parfor i = 1:numel(c) % loop over parameter c
xi = 0;
minFVal = inf;
for j = 1:numel(setup) % loop over each set configuration setup
% find optimal initial condition and record corresponding value of
% objective function
[xInitial,fval] = fminsearch(#(x0) objectiveFunction(x0,c(i),...
paramFile,setup(j)),1);
if fval<minFVal % keep the best solution
xi = xInitial;
minFVal = fval;
end
end
% save some variables
saveInParForLoop(['tempresult_' num2str(i)],xi,minFVal);
end
% Now combine temporary files into single vectors
xi = zeros(size(c)); mfv = xi;
for i = 1:numel(c)
S = load(['tempresult_' num2str(i) '.mat'],'xi','minFVal');
xi(i) = S.xi;
mfv(i) = S.minFVal;
end
% delete the temporary files now that the data has been consolidated
for i = 1:numel(c)
delete(['tempresult_' num2str(i) '.mat']);
end
end
function saveInParForLoop(filename,xi,minFVal)
% you can't save directly in a parfor loop, this is the workaround
save(filename,'xi','minFVal')
end
Here is the function to define the equations
function [der,transform] = equations(paramFile,setup)
% Defines the differential equation and a transformation for the solution
% used to calculate the objective function
% Note in my actual code I generate these equations earlier
% and pass them around directly, rather than always redefining them
load(paramFile,'a','f')
switch setup.eqs
case 'first'
der = #(x) f(x)*2+a;
transform = #(x) exp(x);
case 'second'
der = #(x) f(x)/2-a;
transform = #(x) sqrt(abs(x));
end
and here is the function to evaluate the objective function
function val = objectiveFunction(x0,c,paramFile,setup)
load(paramFile,'a')
% specify the ODE solver and AbsTol from s
solver = str2func(setup.method);
options = odeset('AbsTol',setup.atol);
% get the differential equation and transform equations
[der,transform] = equations(paramFile,setup);
dxdt = #(t,y) der(y);
% solve the IVP
[~,y] = solver(dxdt,0:.05:1,x0,options);
% calculate the objective function value
val = norm(transform(y)-c*a);
If you run this code it will create 100 temporary files, then delete them, and it will also create the params file, which won't be deleted. You will need the parallel computing toolbox.
There's just a chance you might be running into this known problem: https://uk.mathworks.com/support/bugreports/1976165 . This is marked as being fixed in R2019b, which has just been released. (The leak caused by this is tiny but persistent - so it might indeed take days to become apparent).
I need to create a square matrix $V$ iteratively over 100000+ times per pack.
When just doing it traditionally, the computational consumption is at around 70s.(Over 1 mintes) And I need to repeate this process for over 100 packs.That's about 1 hours extra time.
It turned out to me that when calculating the matrix using a double for loop $V(x,y)$, the matlab is only using a single thread. Howver, there are 12 threads in the computer, and there should be a way to use all of them to assign the matrix faster.
The type of function is
$V(x,y)=exp((x-variation_1).^2+(y-variation_2).^2)$
I thought about using GPU. However, as it turned out, the GPU array is calculating it much slower than CPU.
I also thought about using the parpool function. However, not only it cost more time to send the matrix into the parallel pool, is also denied the access to the $V$ itself.
How can I tell the CPU to calculating the matrix with all the threads at faster speed?
You should always use matrix and vector operations rather than for loop.
If x and y are constant for all cases, you can use meshgrid to generate x and y once.
for example, consider the following code which uses double for loop:
v = zeros(10000,10000);
tic;
for x=1:10000
for y = 1:10000
v(x,y) = exp((x/10000).^2+(y/10000).^2);
end
end
toc
On my computer it runs about 11 seconds.
Now by using meshgrid:
%This is done only once
[x,y] = meshgrid((1:10000)/10000,(1:10000)/10000);
tic;
v = exp(x.^2+y.^2);
toc
Which takes about 4 seconds, not including the meshgrid.
I am having a hard time grasping how to count FLOPs. One moment I think I get it, and the next it makes no sense to me. Some help explaining this would greatly be appreciated. I have looked at all other posts about this topic and none have completely explained in a programming language I am familiar with (I know some MATLAB and FORTRAN).
Here is an example, from one of my books, of what I am trying to do.
For the following piece of code, the total number of flops can be written as (n*(n-1)/2)+(n*(n+1)/2) which is equivalent to n^2 + O(n).
[m,n]=size(A)
nb=n+1;
Aug=[A b];
x=zeros(n,1);
x(n)=Aug(n,nb)/Aug(n,n);
for i=n-1:-1:1
x(i) = (Aug(i,nb)-Aug(i,i+1:n)*x(i+1:n))/Aug(i,i);
end
I am trying to apply the same principle above to find the total number of FLOPs as a function of the number of equations n in the following code (MATLAB).
% e = subdiagonal vector
% f = diagonal vector
% g = superdiagonal vector
% r = right hand side vector
% x = solution vector
n=length(f);
% forward elimination
for k = 2:n
factor = e(k)/f(k‐1);
f(k) = f(k) – factor*g(k‐1);
r(k) = r(k) – factor*r(k‐1);
end
% back substitution
x(n) = r(n)/f(n);
for k = n‐1:‐1:1
x(k) = (r(k)‐g(k)*x(k+1))/f(k);
end
I'm by no means expert at MATLAB but I'll have a go.
I notice that none of the lines of your code index ranges of your vectors. Good, that means that every operation I see before me is involving a single pair of numbers. So I think the first loop is 5 FLOPS per iteration, and the second is 3 per iteration. And then there's that single operation in the middle.
However, MATLAB stores everything by default as a double. So the loop variable k is itself being operated on once per loop and then every time an index is calculated from it. So that's an extra 4 for the first loop and 2 for the second.
But wait - the first loop has 'k-1' twice, so in theory one could optimise that a bit by calculating and storing that, reducing the number of FLOPs by one per iteration. The MATLAB interpreter is probably able to spot that sort of optimisation for itself. And for all I know it can work out that k could in fact be an integer and everything is still okay.
So the answer to your question is that it depends. Do you want to know the number of FLOPs the CPU does, or the minimum number expressed in your code (ie the number of operations on your vectors alone), or the strict number of FLOPs that MATLAB would perform if it did no optimisation at all? MATLAB used to have a flops() function to count this sort of thing, but it's not there anymore. I'm not an expert in MATLAB by any means, but I suspect that flops() has gone because the interpreter has gotten too clever and does a lot of optimisation.
I'm slightly curious to know why you wish to know. I used to use flops() to count how many operations a piece of maths did as a crude way of estimating how much computing grunt I'd need to make it work in real time written in C.
Nowadays I look at the primitives themselves (eg there's a 1k complex FFT, that'll be 7us on that CPU according to the library datasheet, there's a 2k vector multiply, that'll be 2.5us, etc). It gets a bit tricky because one has to consider cache speeds, data set sizes, etc. The maths libraries (eg fftw) themselves are effectively opaque so that's all one can do.
So if you're counting the FLOPs for that reason you'll probably not get a very good answer.
I dont quite get the vectorizing way of thinking of matlab, mostly due to the simple examples provided in the documentation, and i hope someone can help me understand it a little better.
So, what i'm trying to accomplish is to take a sample of NxN from a matrix of ncols x nrows x ielements and compute the average for each ielement and store the maximum of the averages. Using for loops, the code would look like this:
for x = 1+margin : nrows-margin
for y = 1+margin : ncols-margin
for i=1:ielem
% take a NxN sample
sample = input_matrix(y-margin:y+margin,x-margin:x+margin,i)
% compute the average of all elements
result(i) = mean2(sample);
end %for i
% store the max of the computed averages
output_matrix(y,x)=max(result);
end %for y
end %for x
can anyone do a good vectorization of this example of a situation ? T
First of all, vectorization is not as important as it once was, due to enhancements in compiling the code before it is ran, but it's still a very common practice and can lead to some enhancements. Older Matlab version executed one line at a time, which would leave a for loop much slower than a vectorized version of the same code.
The part of your matrix that could be vectorized is the inner more for loop. I'll show a simple example of what you are trying to do, I'll let you take the example and put it into your code.
input=randn(5,5,3);
max(mean(mean(input,1),2))
Basically, the inner two mean take the mean of the input array, and the outer max will find the maximum value over the range. If you want, you can break it out step by step, and see what it does. The mean(input,1) will take the mean over the first dimension, mean(input,2) over the second, etc. After the first two means are done, all that is left is a vector, which the max function will easily work. It should be noted that the size of the vector pre-max is [1 1 3], the dimensions are preserved when doing this operation.
Sorry if this is obvious but I searched a while and did not find anything (or missed it).
I'm trying to solve linear systems of the form Ax=B with A a 4x4 matrix, and B a 4x1 vector.
I know that for a single system I can use mldivide to obtain x: x=A\B.
However I am trying to solve a great number of systems (possibly > 10000) and I am reluctant to use a for loop because I was told it is notably slower than matrix formulation in many MATLAB problems.
My question is then: is there a way to solve Ax=B using vectorization with A 4x4x N and B a matrix 4x N ?
PS: I do not know if it is important but the B vector is the same for all the systems.
You should use a for loop. There might be a benefit in precomputing a factorization and reusing it, if A stays the same and B changes. But for your problem where A changes and B stays the same, there's no alternative to solving N linear systems.
You shouldn't worry too much about the performance cost of loops either: the MATLAB JIT compiler means that loops can often be just as fast on recent versions of MATLAB.
I don't think you can optimize this further. As explained by #Tom, since A is the one changing, there is no benefit in factoring the various A's beforehand...
Besides the looped solution is pretty fast given the dimensions you mention:
A = rand(4,4,10000);
B = rand(4,1); %# same for all linear systems
tic
X = zeros(4,size(A,3));
for i=1:size(A,3)
X(:,i) = A(:,:,i)\B;
end
toc
Elapsed time is 0.168101 seconds.
Here's the problem:
you're trying to perform a 2D operation (mldivide) on a 3d matrix. No matter how you look at it, you need reference the matrix by index which is where the time penalty kicks in... it's not the for loop which is the problem, but it's how people use them.
If you can structure your problem differently, then perhaps you can find a better option, but right now you have a few options:
1 - mex
2 - parallel processing (write a parfor loop)
3 - CUDA
Here's a rather esoteric solution that takes advantage of MATLAB's peculiar optimizations. Construct an enormous 4k x 4k sparse matrix with your 4x4 blocks down the diagonal. Then solve all simultaneously.
On my machine this gets the same solution up to single precision accuracy as #Amro/Tom's for-loop solution, but faster.
n = size(A,1);
k = size(A,3);
AS = reshape(permute(A,[1 3 2]),n*k,n);
S = sparse( ...
repmat(1:n*k,n,1)', ...
bsxfun(#plus,reshape(repmat(1:n:n*k,n,1),[],1),0:n-1), ...
AS, ...
n*k,n*k);
X = reshape(S\repmat(B,k,1),n,k);
for a random example:
For k = 10000
For loop: 0.122570 seconds.
Giant sparse system: 0.032287 seconds.
If you know that your 4x4 matrices are positive definite then you can use chol on S to improve the accuracy.
This is silly. But so is how slow matlab's for loops still are in 2015, even with JIT. This solution seems to find a sweet spot when k is not too large so everything still fits into memory.
I know this post is years old now, but I'll contribute my two cents anyway. You CAN put all of your A matricies into a bigger block diagonal matrix, where there will be 4x4 blocks on the diagonal of a big matrix. The right hand side will be all of your b vectors stacked on top of each other over and over. Once you set this up, it is represented as a sparse system, and can be efficiently solved with the algorithms mldivide chooses. The blocks are numerically decoupled, so even if there are singular blocks in there, the answers for the nonsingular blocks should be right when you use mldivide. There is a code that took this approach on MATLAB Central:
http://www.mathworks.com/matlabcentral/fileexchange/24260-multiple-same-size-linear-solver
I suggest experimenting to see if the approach is any faster than looping. I suspect it can be more efficient, especially for large numbers of small systems. In particular, if there are nice formulas for the coefficients of A across the N matricies, you can build the full left hand side using MATLAB vector operations (without looping), which could give you additional cost savings. As others have noted, vectorized operations aren't always faster, but they often are in my experience.