Efficient way to perform loops of matrix multiplications - matlab

I have an implementation that involves multiplying matrices, summing them up, and storing them. It goes like this,
A = 0;
b = 0;
for i=1:1225
... load A_i operator
A_i_obj = load([path_temp,'A_',num2str(i),'.mat']);
A_i = (A_i_obj.A);
% z_i is some variable of size Nx1 that I compute in this loop something like
% x is some variable of size Nx1 calculated above this loop
z_i = A_i*x;
% I have to perform some operations like these
y_i = A_i*(z_i + x);
A = A + A_i*A_i'
b = b + A_i*y_i;
end
% A and b will be used here something like
soln = inv(A)*b;
My problem is the large amount of simulation time being consumed by the above code. Even when the operations inside the loop are efficient (let's say ~0.01mins), the entire looped implementation still consumes about ~12-13mins. Can somebody please help me out and suggest an efficient way to do this? Thanks so much!

I don't get the point of loading the mat file in every iteration... load your data only once and outside the for loop. After all, the operation is very time expensive and, more important, once the data is loaded into a variable it doesn't have to be performed anymore.
A = 0;
A_i = load('A_i.mat');
for i = 1:1225
% ...
y_i = A_i * (z_i + x);
A = A + A_i * A_i';
end
On a side note, you are assigning the output of the load function directly to a variable using the following overload:
S = load(___) loads data into S, using any of the input arguments in the previous syntax group.
- If filename is a MAT-file, then S is a structure array.
- If filename is an ASCII file, then S is a double-precision array containing data from the file.
So I suppose your file is in ASCII format, otherwise your A_i will not be a matrix but a structure array. Also, don't use the ' operator to transpose a matrix, but .', since the first one corresponds to the complex conjugate transpose:
A = A + A_i * A_i.';
Since you omitted a part of the code running inside the loop, I can't do more in order to improve its performance.

Profile, profile, profile
I would guess that the process of loading the matrices is killing you, but it is hard to say. To get a better idea of which step is killing you, start your code with
profile on
and end it with
profile viewer
Then run it again. When the code completes, it will show you the time taken by each call, which will help you figure out where the problem is.

Related

MATLAB code requires too much time for compiling

I am trying to compute this
in MATLAB but the code requires about 8 hours to compile. In particular e, Ft=[h(t);q(t)] and Omega are 2x1 matrices (e' is 1x2), Gamma is a 2x2 matrix and n=30. Can someone help me to optimize this code?
I tried in this way:
aux=[0;0];
for k=0:29
for j=1:k-1
aux=[aux Gamma^j*Omega];
end
E(t,k+1)= e'*(sum(aux,2)+Gamma^k*[h(t);q(t)]);
end
Vix=1/30*sum(E,2);
EDIT
now I changed into this and it is faster, but I am not sure that I am applying correctly the formula in the picture...
for t=2:T
% 1. compute today's volatility
csi(t) = log(SP500(t)/SP500(t-1))-r(t)+0.5*h(t);
q(t+1) = omega+rho*q(t)+phi*((csi(t)-lambda*sqrt(h(t)))^2-h(t));
h(t+1) = q(t)+alpha*((csi(t)-lambda*sqrt(h(t)))^2-q(t))+beta*(h(t)-q(t));
for k=1:30
aux=zeros(2,k);
for j=0:k-1
aux(:,j+1)=Gamma^j*Omega;
end
E(t,k)= e'*(sum(aux,2)+Gamma^k*[h(t);q(t)]);
end
end
Vix(2:end)=1/30*sum(E(2:end,:),2);
(I don't need Vix(1))
Here are some reasons I can think of:
REPEATED COPYING(No preallocation) The main reason for the long run time is the line aux=[aux Gamma^j*Omega] line, in which an array is concatenated at every loop iteration. MATLAB's debugger should have flagged this for you in its editor and should have cited that "memory preallocation" using zeros should be implemented.
Essentially, when one concatenates arrays this way, MATLAB is internally making copies of the array at every loop iteration, thus, in addition to the math operations copying is taking place. As the array grows, the copying operations become ever more expensive. This is avoided by preallocation, which consists of predefining the size of the storage array (in this case the variable aux) so that MATLAB doesn't have to keep on allocating space on the go. Try:
aux = zeros(2, 406); %Creates a 2 by 406 array. I explain how I get 406 below:
p = 0; %A variable that indexes the columns of aux
for k=0:29
for j=1:k-1
p = p+1; %Update column counter
aux(:,p) = Gamma^j*Omega; % A 2x2 matrix multiplied by a 2x1 matrix yields a 2x1.
end
E(t,k+1)= e'*(sum(aux,2)+Gamma^k*[h(t);q(t)]);
end
Vix=1/30*sum(E,2);
Now, MATLAB simply overwrites the individual elements of aux instead of copying aux, and concatenating it with Gamma^j*Omega, and then overwriting aux. Essentially, the above makes MATLAB allocate space for aux ONCE instead of 406 times. I figured out that aux ends up being a 2 by 406 array for the n=30 case in the end by running this code:
p = 0;
for k = 0:29
for j = 1:k-1
p = p + 1;
end
end
To know the final size of aux for other values of n you should see if a formula for it is available (or derive your own).
LOOPING TRANSPOSITION OF A CONSTANT?
Next, e'. As you may know, ' is the transpose operation. From your sample code, the variable e is not edited inside the for loops, yet you have the ' operator inside the outer for loop. If you perform the transpose operation once outside the outer for loop you save yourself the expense of transposing it at every loop iteration.
RUNNING TOTAL
As a final note, I would suggest replacing sum(aux,2) with a variable that keeps a running total. This is because currently, this makes MATLAB sum over the entirety of aux at every loop iteration.
Hope this helps mate.

Interpolation with pre-lookup

I need to perform many (thousands) of look-up operations, where the break-points in the look-up do not change. A simple example would be,
% create some dummy data
% In practice
% - the grid will be denser, and not as regular
% - each page of v will be different
% - v will have thousands of pages
[x,y] = ndgrid(-5:0.1:5);
n_surfaces = 10;
v = repmat(sin(x.^2 + y.^2) ./ (x.^2 + y.^2),1,1,n_surfaces);
[xq,yq] = ndgrid(-5:0.2:5);
vq = nan([size(xq),n_surfaces]);
for idx = 1:n_surfaces
F = griddedInterpolant(x,y,v(:,:,idx));
vq(:,:,idx) = F(xq,yq);
end
Note that the above code can be sped up slightly by doing,
F = griddedInterpolant(x,y,v(:,:,1));
for idx = 1:n_surfaces
F.Values = v(:,:,idx);
vq(:,:,idx) = F(xq,yq);
end
However, in general interpolation is a two step process,
Determining the index and interval fraction for each new point
Performing the interpolation to obtain the new value
and in the above code both of these steps are being performed during every loop. However Step 1 will be identical in every loop and hence performing it thousands of times is inefficient. I'm wondering if anyone has a workaround to split the two steps and only perform the first step once, and just perform the second step in the loop?
(For those familiar with Simulink, this is equivalent to using the Prelookup in conjunction with multiple Interpolation Using Prelookup blocks.)
Edit:
The question linked in the comment by #rahnema1 (Precompute weights for multidimensional linear interpolation) is pretty much what I am looking for. However, on converting that code to run on the CPU (rather than a GPU), and using double arithmetic, it is about 3 times slower than using the m-code at the start of my question. That timing seems to hold irrespective of the number of surfaces being interpolated (I have tried values from 10 through to 1000.)
The problem is in performing the indexing operation V(I) used in the linked code. Even when the complete operation sum(W.*V(I),2) is implemented in a mex file, the execution times are slower than the above m-code.

How to extract a submatrix without making a copy in Matlab

I have a large matrix, and I need to extract a small matrix taken from a sliding window which runs all over the large matrix, but during the operations the content of the extracted matrix does not change, so I'd like to extract the submatrix without creating a new copy but instead just acts like a C pointer that points to a portion of the large matrix. How can I do this? Please help me, thank you very much :)
I did some benchmarking to test if not using an explicit temporary matrix is faster, and it's probably not:
function move_mean(N)
M = randi(100,N);
window_size = [50 50];
dir_time = timeit(#() direct(M,window_size))
tmp_time = timeit(#() with_tmp(M,window_size))
end
function direct(M,window_size)
m = zeros(size(M)./2);
for r = 1:size(M,1)-window_size(1)
for c = 1:size(M,2)-window_size(2)
m(r,c) = mean(mean(M(r:r+window_size(1),c:c+window_size(2))));
end
end
end
function with_tmp(M,window_size)
m = zeros(size(M)./2);
for r = 1:size(M,1)-window_size(1)
for c = 1:size(M,2)-window_size(2)
tmp = M(r:r+window_size(1),c:c+window_size(2));
m(r,c) = mean(mean(tmp));
end
end
end
for M at size 100*100:
dir_time =
0.22739
tmp_time =
0.22339
So it's seems like using a temporary variable only makes your code readable, not slower.
In this answer I describe what is the 'best' solution in general. For this answer I define 'best' as most readable without a significant performance hit. (Partially shown by the existing answer).
Basically there are 2 situations that you may be in.
1. You use your submatrix several times
In this situation the best solution in general is to create a temporary variable containing the submatrix.
A = M(rmin:rmax, cmin:cmax)
There may be ways around it (defining a function/anonymous function that indexes into the matrix for you), but in general that won't make you happy.
2. You use your submatrix only 1 time
In this case the best solution is typically exactly what you referred to in the comments:
M(rmin:rmax, cmin:cmax)
A specific case of using the submatrix only 1 time, is when it is passed once to a function. Of course the contents of the submatrix may be used in that function several times, but that is irrelevant.

Details in sparse indexing

I have some code which uses sparse indexing (and there's no way that I can get around that). I run this in a function, and use it for two problems, where the sizes of all the variables involved do not change. However, for one problem, the sparse indexing part takes 5 seconds, and for the other, takes 25 seconds.
I checked the size of every variable involved, and they are the same for both problems. I also checked that xv is a full matrix for both problem types.
So, anyone else ever run into something weird like this? Any ideas as to why this would happen? Mainly I am trying to make the code more efficient, and while 5 seconds is ok for my particular application, 25 seconds (especially when I can't explain it) is very bad.
Edit: Here is a link to a photo that profiles this weird behavior. The runtime values were recorded on the third run to ensure that the size of X is also not changing. And I did check that xv is a dense (not sparse) matrix both times.
https://www.dropbox.com/s/i41j6afanzbjdyg/weird_bcd_thing.png?dl=0
Thanks so much for any help!
Code below (runs in a for loop). If I use ptype = 1, then it's 5 seconds, ptype = 3 is 25 seconds.
clvec = cliques{k};
xcurr = full(X(clvec));
xv = reshape(xcurr - Z(offset_index(k) + 1 : offset_index(k) + ncl^2),ncl,ncl);
%these two functions both take a dense symmetric matrix and return a dense symmetric matrix, and in both cases the size is the same for a given k.
if ptype == 1
xv = proj_PSD(xv,0,0);
elseif ptype == 3
xv = proj_Schoenberg(xv,0);
end
Xd = vec(xv) - xcurr;
%THIS IS THE WEIRD LINE
tic
X(clvec) = xv;
toc;
In the 'WEIRD LINE' : X(clvec) = xv;
You are using a random access to a sparse matrix.
This access in a sparse matrix is not constant and depends on its data. The time is may depend on the matrix values and the indices you are trying to access.
This is not the case in regular matrix, where you usually get a stable access time, and faster.
In order to assure a stable constant access try to change the implementation based on your specific matrix usage, try to avoid values assign by random access.
See next code for as a reference:
X = sparse(randi(100,50,1),randi(100,50,1),randn(1),100,100);
for i=1:10000
rand_inds{i} = randperm(10000,100);
end
for i=1:100
ti = tic;
X(rand_inds{i}) = 3;
to_X(i) = toc(ti);
end
Xf = full(X);
for i=1:100
ti = tic;
Xf(rand_inds{i}) = 3;
to_Xf(i) = toc(ti);
end
figure;plot(to_X);hold on;plot(to_Xf,'r');
I solved my problem! I'm posting the answer because I think it's interesting.
One thing I didn't mention in the question is that the loop goes from k = 1 to k = L, and for ptype = 3, we add one more step, and that's assigning all the diagonal indices to 0:
X(diag_index) = 0
where diag_index is computed ahead of time.
The problem is, instead of just assigning the values to 0, MATLAB will automatically discard these indices, and the next loop, when accessing diagonal indices, it has to re-allocate for X. So, I changed that line to
X(diag_index) = eps;
and now they both run equally fast! (It's not the best solution, since that's going to be a source of error later, but there's no more mystery!)
The answer is never what you think it would be...

Recursive loop optimization

Is there a way to rewrite my code to make it faster?
for i = 2:length(ECG)
u(i) = max([a*abs(ECG(i)) b*u(i-1)]);
end;
My problem is the length of ECG.
You should pre-allocate u like this
>> u = zeros(size(ECG));
or possibly like this
>> u = NaN(size(ECG));
or maybe even like this
>> u = -Inf(size(ECG));
depending on what behaviour you want.
When you pre-allocate a vector, MATLAB knows how big the vector is going to be and reserves an appropriately sized block of memory.
If you don't pre-allocate, then MATLAB has no way of knowing how large the final vector is going to be. Initially it will allocate a short block of memory. If you run out of space in that block, then it has to find a bigger block of memory somewhere, and copy all the old values into the new memory block. This happens every time you run out of space in the allocated block (which may not be every time you grow the array, because the MATLAB runtime is probably smart enough to ask for a bit more memory than it needs, but it is still more than necessary). All this unnecessary reallocating and copying is what takes a long time.
There are several several ways to optimize this for loop, but, surprisingly memory pre-allocation is not the part that saves the most time. By far. You're using max to find the largest element of a 1-by-2 vector. On each iteration you build this vector. However, all you're doing is comparing two scalars. Using the two argument form of max and passing it two scalar is MUCH faster: 75+ times faster on my machine for large ECG vectors!
% Set the parameters and create a vector with million elements
a = 2;
b = 3;
n = 1e6;
ECG = randn(1,n);
ECG2 = a*abs(ECG); % This can be done outside the loop if you have the memory
u(1,n) = 0; % Fast zero allocation
for i = 2:length(ECG)
u(i) = max(ECG2(i),b*u(i-1)); % Compare two scalars
end
For the single input form of max (not including creation of random ECG data):
Elapsed time is 1.314308 seconds.
For my code above:
Elapsed time is 0.017174 seconds.
FYI, the code above assumes u(1) = 0. If that's not true, then u(1) should be set to it's value after preallocation.