Vectorizing nested for loops and pointwise multiplication in matlab - matlab

Long time reader, first time inquisitor. So I am currently hitting a serious bottleneck from the following code:
for kk=1:KT
parfor jj=1:KT
umodes(kk,jj,:,:,:) = repmat(squeeze(psi(kk,jj,:,:)), [1,1,no_bands]).*squeeze(umodes(kk,jj,:,:,:));
end
end
In plain language, I need to tile the multi-dimensional array 'psi' across another dimension of length 'no_bands' and then perform pointwise multiplication with the matrix 'umodes'. The issue is that each of the arrays I am working with is large, on the order of 20 GB or more.
What is happening then, I suspect, is that my processors grind to a halt due to cache limits or because data is being paged. I am reasonably convinced there is no practical way to reduce the size of my arrays, so at this point I am trying to reduce computational overhead to a bare minimum.
If that is not possible, it might be time to think of using a proper programming language where I can enforce pass by reference to avoid unnecessary replication of arrays.

Often bsxfun uses less memory than repmat. So you can try:
for kk=1:KT
parfor jj=1:KT
umodes(kk,jj,:,:,:) = bsxfun(#times, squeeze(psi(kk,jj,:,:)), squeeze(umodes(kk,jj,:,:,:)));
end
end
Or you can vectorize the two loops. Vectorizing is usually faster, although not necessarily more memory-efficient, so I'm not sure it helps in your case. In any case, bsxfun benefits from multihreading:
umodes = bsxfun(#times, psi, umodes);

Related

Is Matlab's arrayfun operation more memory-efficient than a pre-allocated for loop?

For example, would
raw = ones(1000, 1);
x = cell(1000, 1);
for i=1:size(raw, 1)
x(i) = some_process(raw(i));
end
require more memory than
raw = ones(1000, 1);
x = arrayfun(#(r) some_process(r), raw, 'UniformOutput', False);
As far as my understanding goes, the pre-allocated for-loop would require initialization of both
raw and x before beginning the process thus requiring more memory overall. Is this correct?
Thank you for your help.
No, it is not more efficient, and in some cases may even perform worse. arrayfun is basically just syntactic sugar for a for loop. Matlab's JIT is able to optimize "in-place" modifications of preallocated arrays. I have not encountered any cases where arrayfun() used with a function handle performs better than a for loop and preallocated arrays on recent-ish versions of Matlab.
In your particular case, they're likely to perform about the same. So use whichever style you prefer.
Note that, strictly speaking, arrayfun does not perform "vectorized" operations. "vectorized" refers to some built-in Matlab facility that is able to operate on a whole array inside the Matlab engine, using . arrayfun isn't able to magically do that.
Exception: if you are using the Matlab Parallel Computing Toolbox, or are operating on tall arrays, then arrayfun actually can parallelize operations across multiple array elements, and may go faster than a for loop.

how to speed up Matlab nested for loops when I cannot vectorize the calculations?

I have three big 3D arrays of the same size [41*141*12403], named in the Matlab code below alpha, beta and ni. From them I need to calculate another 3D array with the same size, which is obtained elementwise from the original matrices through a calculation that combines an infinite sum and definite integral calculations, using the value of each element. It therefore seems inevitible to have to use several nested loops to make this calculation. The code is already running now for several hours(!) and it is still in the first iteration of the outer loop (which needs to be performed 41 times!! According to my calculation, in this way the program will have to run more than two years!!!). I don't know how to optimize the code. Please help me !!
the code I use:
z_len=size(KELDYSH_PARAM_r_z_t,1); % 41 rows
r_len=size(KELDYSH_PARAM_r_z_t,2); % 141 columns
t_len=size(KELDYSH_PARAM_r_z_t,3); % 12403 slices
sumRes=zeros(z_len,r_len,t_len);
for z_ind=1:z_len
z_ind % in order to track the advancement of the calculation
for r_ind=1:r_len
for t_ind=1:t_len
sumCurrent=0;
sumPrevious=inf;
s=0;
while abs(sumPrevious-sumCurrent)>1e-6
kapa=kapa_0+s; %some scalar
x_of_w=(beta(z_ind,r_ind,t_ind).*(kapa-ni...
(z_ind,r_ind,t_ind))).^0.5;
sumPrevious=sumCurrent;
sumCurrent=sumCurrent+exp(-alpha(z_ind,r_ind,t_ind).* ...
(kapa-ni(z_ind,r_ind,t_ind))).*(x_of_w.^(2*abs(m)+1)/2).* ...
w_m_integral(x_of_w,m);
s=s+1;
end
sumRes(z_ind,r_ind,t_ind)=sumCurrent;
end
end
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function res=w_m_integral(x_of_w,m)
res=quad(#integrandFun,0,1,1e-6);
function y=integrandFun(t)
y=exp(-x_of_w^2*t).*t.^(abs(m))./((1-t).^0.5);
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Option 1 - more vectorising
It's a pretty complex model you're working with and not all the terms are explained, but some parts can still be further vectorised. Your alpha, beta and ni matrices are presumably static and precomputed? Your s value is a scalar and kapa could be either, so you can probably precompute the x_of_w matrix all in one go too. This would give you a very slight speedup all on its own, though you'd be spending memory to get it - 71 million points is doable these days but will call for an awful lot of hardware. Doing it once for each of your 41 rows would reduce the burden neatly.
That leaves the integral itself. The quad function doesn't accept vector inputs - it would be a nightmare wouldn't it? - and neither does integral, which Mathworks are recommending you use instead. But if your integration limits are the same in each case then why not do the integral the old-fashioned way? Compute a matrix for the value of the integrand at 1, compute another matrix for the value of the integrand at 0 and then take the difference.
Then you can write a single loop that computes the integral for the whole input space then tests the convergence for all the matrix elements. Make a mask that notes the ones that have not converged and recalculate those with the increased s. Repeat until all have converged (or you hit a threshold for iterations).
Option 2 - parallelise it
It used to be the case that matlab was much faster with vectorised operations than loops. I can't find a source for it now but I think I've read that it's become a lot faster recently with for loops too, so depending on the resources you have available you might get better results by parallelising the code you currently have. That's going to need a bit of refactoring too - the big problems are overheads while copying in data to the workers (which you can fix by chopping the inputs up into chunks and just feeding the relevant one in) and the parfor loop not allowing you to use certain variables, usually ones which cover the whole space. Again chopping them up helps.
But if you have a 2 year runtime you will need a factor of at least 100 I'm guessing, so that means a cluster! If you're at a university or somewhere where you might be able to get a few days on a 500-core cluster then go for that...
If you can write the integral in a closed form then it might be amenable to GPU computation. Those things can do certain classes of computation very fast but you have to be able to parallelise the job and reduce the actual computation to something basic comprised mainly of addition and multiplication. The CUDA libraries have done a lot of the legwork and matlab has an interface to them so have a read about those.
Option 3 - reduce the scope
Finally, if neither of the above two results in sufficient speedups, then you may have to reduce the scope of your calculation. Trim the input space as much as you can and perhaps accept a lower convergence threshold. If you know how many iterations you tend to need inside the innermost while loop (the one with the s counter in it) then it might turn out that reducing the convergence criterion reduces the number of iterations you need, which could speed it up. The profiler can help see where you're spending your time.
The bottom line though is that 71 million points are going to take some time to compute. You can optimise the computation only so far, the odds are that for a problem of this size you will have to throw hardware at it.

Matlab division of large matrices [duplicate]

I have this problem which requires solving for X in AX=B. A is of the order 15000 x 15000 and is sparse and symmetric. B is 15000 X 7500 and is NOT sparse. What is the fastest way to solve for X?
I can think of 2 ways.
Simplest possible way, X = A\B
Using for loop,
invA = A\speye(size(A))
for i = 1:size(B,2)
X(:,i) = invA*B(:,i);
end
Is there a better way than the above two? If not, which one is best between the two I mentioned?
First things first - never, ever compute inverse of A. That is never sparse except when A is a diagonal matrix. Try it for a simple tridiagonal matrix. That line on its own kills your code - memory-wise and performance-wise. And computing the inverse is numerically less accurate than other methods.
Generally, \ should work for you fine. MATLAB does recognize that your matrix is sparse and executes sparse factorization. If you give a matrix B as the right-hand side, the performance is much better than if you only solve one system of equations with a b vector. So you do that correctly. The only other technical thing you could try here is to explicitly call lu, chol, or ldl, depending on the matrix you have, and perform backward/forward substitution yourself. Maybe you save some time there.
The fact is that the methods to solve linear systems of equations, especially sparse systems, strongly depend on the problem. But in almost any (sparse) case I imagine, factorization of a 15k system should only take a fraction of a second. That is not a large system nowadays. If your code is slow, this probably means that your factor is not that sparse sparse anymore. You need to make sure that your matrix is properly reordered to minimize the fill (added non-zero entries) during sparse factorization. That is the crucial step. Have a look at this page for some tests and explanations on how to reorder your system. And have a brief look at example reorderings at this SO thread.
Since you can answer yourself which of the two is faster, I'll try yo suggest the next options.
Solve it using a GPU. Plenty of details can be found online, including this SO post, a matlab benchmarking of A/b, etc.
Additionally, there's the MATLAB add-on of LAMG (Lean Algebraic Multigrid). LAMG is a fast graph Laplacian solver. It can solve Ax=b in O(m) time and storage.
If your matrix A is symmetric positive definite, then here's what you can do to solve the system efficiently and stably:
First, compute the cholesky decomposition, A=L*L'. Since you have a sparse matrix, and you want to exploit it to accelerate the inversion, you should not apply chol directly, which would destroy the sparsity pattern. Instead, use one of the reordering method described here.
Then, solve the system by X = L'\(L\B)
Finally, if are not dealing with potential complex values, then you can replace all the L' by L.', which gives a bit further acceleration because it's just trying to transpose instead of computing the complex conjugate.
Another alternative would be the preconditioned conjugate gradient method, pcg in Matlab. This one is very popular in practice, because you can trade off speed for accuracy, i.e. give it less number of iterations, and it will give you a (usually pretty good) approximate solution. You also never need to store the matrix A explicitly, but just be able to compute matrix-vector product with A, if your matrix doesn't fit into memory.
If this takes forever to solve in your tests, you are probably going into virtual memory for the solve. A 15k square (full) matrix will require 1.8 gigabytes of RAM to store in memory.
>> 15000^2*8
ans =
1.8e+09
You will need some serious RAM to solve this, as well as the 64 bit version of MATLAB. NO factorization will help you unless you have enough RAM to solve the problem.
If your matrix is truly sparse, then are you using MATLAB's sparse form to store it? If not, then MATLAB does NOT know the matrix is sparse, and does not use a sparse factorization.
How sparse is A? Many people think that a matrix that is half full of zeros is "sparse". That would be a waste of time. On a matrix that size, you need something that is well over 99% zeros to truly gain from a sparse factorization of the matrix. This is because of fill-in. The resulting factorized matrix is almost always nearly full otherwise.
If you CANNOT get more RAM (RAM is cheeeeeeeeep you know, certainly once you consider the time you have wasted trying to solve this) then you will need to try an iterative solver. Since these tools do not factorize your matrix, if it is truly sparse, then they will not go into virtual memory. This is a HUGE savings.
Since iterative tools often require a preconditioner to work as well as possible, it can take some study to find the best preconditioner.

Efficient way to solve for X in AX=B in MATLAB when both A and B are big matrices

I have this problem which requires solving for X in AX=B. A is of the order 15000 x 15000 and is sparse and symmetric. B is 15000 X 7500 and is NOT sparse. What is the fastest way to solve for X?
I can think of 2 ways.
Simplest possible way, X = A\B
Using for loop,
invA = A\speye(size(A))
for i = 1:size(B,2)
X(:,i) = invA*B(:,i);
end
Is there a better way than the above two? If not, which one is best between the two I mentioned?
First things first - never, ever compute inverse of A. That is never sparse except when A is a diagonal matrix. Try it for a simple tridiagonal matrix. That line on its own kills your code - memory-wise and performance-wise. And computing the inverse is numerically less accurate than other methods.
Generally, \ should work for you fine. MATLAB does recognize that your matrix is sparse and executes sparse factorization. If you give a matrix B as the right-hand side, the performance is much better than if you only solve one system of equations with a b vector. So you do that correctly. The only other technical thing you could try here is to explicitly call lu, chol, or ldl, depending on the matrix you have, and perform backward/forward substitution yourself. Maybe you save some time there.
The fact is that the methods to solve linear systems of equations, especially sparse systems, strongly depend on the problem. But in almost any (sparse) case I imagine, factorization of a 15k system should only take a fraction of a second. That is not a large system nowadays. If your code is slow, this probably means that your factor is not that sparse sparse anymore. You need to make sure that your matrix is properly reordered to minimize the fill (added non-zero entries) during sparse factorization. That is the crucial step. Have a look at this page for some tests and explanations on how to reorder your system. And have a brief look at example reorderings at this SO thread.
Since you can answer yourself which of the two is faster, I'll try yo suggest the next options.
Solve it using a GPU. Plenty of details can be found online, including this SO post, a matlab benchmarking of A/b, etc.
Additionally, there's the MATLAB add-on of LAMG (Lean Algebraic Multigrid). LAMG is a fast graph Laplacian solver. It can solve Ax=b in O(m) time and storage.
If your matrix A is symmetric positive definite, then here's what you can do to solve the system efficiently and stably:
First, compute the cholesky decomposition, A=L*L'. Since you have a sparse matrix, and you want to exploit it to accelerate the inversion, you should not apply chol directly, which would destroy the sparsity pattern. Instead, use one of the reordering method described here.
Then, solve the system by X = L'\(L\B)
Finally, if are not dealing with potential complex values, then you can replace all the L' by L.', which gives a bit further acceleration because it's just trying to transpose instead of computing the complex conjugate.
Another alternative would be the preconditioned conjugate gradient method, pcg in Matlab. This one is very popular in practice, because you can trade off speed for accuracy, i.e. give it less number of iterations, and it will give you a (usually pretty good) approximate solution. You also never need to store the matrix A explicitly, but just be able to compute matrix-vector product with A, if your matrix doesn't fit into memory.
If this takes forever to solve in your tests, you are probably going into virtual memory for the solve. A 15k square (full) matrix will require 1.8 gigabytes of RAM to store in memory.
>> 15000^2*8
ans =
1.8e+09
You will need some serious RAM to solve this, as well as the 64 bit version of MATLAB. NO factorization will help you unless you have enough RAM to solve the problem.
If your matrix is truly sparse, then are you using MATLAB's sparse form to store it? If not, then MATLAB does NOT know the matrix is sparse, and does not use a sparse factorization.
How sparse is A? Many people think that a matrix that is half full of zeros is "sparse". That would be a waste of time. On a matrix that size, you need something that is well over 99% zeros to truly gain from a sparse factorization of the matrix. This is because of fill-in. The resulting factorized matrix is almost always nearly full otherwise.
If you CANNOT get more RAM (RAM is cheeeeeeeeep you know, certainly once you consider the time you have wasted trying to solve this) then you will need to try an iterative solver. Since these tools do not factorize your matrix, if it is truly sparse, then they will not go into virtual memory. This is a HUGE savings.
Since iterative tools often require a preconditioner to work as well as possible, it can take some study to find the best preconditioner.

How does MATLAB vectorized code work "under the hood"?

I understand how using vectorization in a language like MATLAB speeds up the code by removing the overhead of maintaining a loop variable, but how does the vectorization actually take place in the assembly / machine code? I mean there still has to be a loop somewhere, right?
Matlab 'vectorization' concept is completely different than the vector instructions concept, such as SSE. This is a common misunderstanding between two groups of people: matlab programmers and C/asm programmers. Matlab 'vectorization', as the word is commonly used, is only about expressing loops in the form of (vectors of) matrix indices, and sometimes about writing things in terms of basic matrix/vector operations (BLAS), instead of writing the loop itself. Matlab 'vectorized' code is not necessarily expressed as vectorized CPU instructions. Consider the following code:
A = rand(1000);
B = (A(1:2:end,:)+A(2:2:end,:))/2;
This code computes mean values for two adjacent matrix rows. It is a 'vectorized' matlab expression. However, since matlab stores matrices column-wise (columns are contiguous in memory), this operation is not trivially changed into operations on SSE vectors: since we perform the operations row-wise the data you need to load into the vectors is not stored contiguously in the memory.
This code on the other hand
A = rand(1000);
B = (A(:,1:2:end)+A(:,2:2:end))/2;
can take advantage of SSE instructions and streaming instructions, since we operate on two adjacent columns at a time.
So, matlab 'vectorization' is not equivalent to using CPU vector instructions. It is just a word used to signify the lack of a loop implemented in MATLAB. To add to the confusion, sometimes people even use the word to say that some loop has been implemented using a built-in function, such as arrayfun, or bsxfun. Which is even more misleading since those functions might be significantly slower than native matlab loops. As robince said, not all loops are slow in matlab nowadays, though you do need to know when they work, and when they don't.
And in any way you always need a loop, it is just implemented in matlab built-in functions / BLAS instead of the users matlab code.
Yes there is still a loop. But it is able to loop directly in compiled code. Loops in Fortran (on which Matlab was originally based) C or C++ are not inherently slow. That they are slow in Matlab is a property of dynamic runtime (they are also slower in other dynamic languages like Python).
Since Matlab has introduced a Just-In-Time compiler loop performance has actually increased dramatically - so the old guidelines to avoid loops are less important with recent versions than they once were.