How to speed up multiple vector convolution in MATLAB? - matlab

I'm having a problem with finding a faster way to convolve multiple vectors. All the vectors have the same length M, so these vectors can be combined as a matrix (A) with the size (N, M). N is the number of vectors.
Now I am using the below code to convolve all these vectors:
B=1;
for i=1:N
B=conv(B, A(i,:));
end
I found this piece of code becomes a speed-limit step in my program since it is frequently called. My question is, is there a way to make this calculation faster? Consider M is a small number (say 2).

It should be quite a lot faster if you implement your convolution as multiplication in the frequency domain.
Look at the way fftfilt is implemented. You can't get optimal performance using fftfilt, because you want to only convert back to time domain after all convolutions are complete, but it nicely illustrates the method.

Convolution is associative. Combine the small kernels, convolve once with the data.
Test data:
M = 2; N = 5; L = 100;
A = rand(N,M);
Bsrc = rand(1,L);
Reference (convolve each kernel with data):
B = Bsrc;
for i=1:N,
B=conv(B, A(i,:));
end
Combined kernels:
A0 = 1;
for ii=1:N,
A0 = conv(A0,A(ii,:));
end
B0 = conv(Bsrc,A0);
Compare:
>> max(abs(B-B0))
ans =
2.2204e-16
If you perform this convolution often, precompute A0 so you can just do one convolution (B0 = conv(Bsrc,A0);).

Related

How to solve a linear system for only one component in MATLAB

I need to solve the linear system
A x = b
which can be done efficiently by
x = A \ b
But now A is very large and I actually only need one component, say x(1). Is there a way to solve this more efficiently than to compute all components of x?
A is not sparse. Here, efficiency is actually an issue because this is done for many b.
Also, storing the inverse of K and multiplying only its first row to b is not possible because K is badly conditioned. Using the \ operator employs the LDL solver in this case, and accuracy is lost when the inverse is explicitly used.
I don't think you'd technically get a speed-up over the very optimized Matlab routine however if you understand how it is solved then you can just solve for one part of x. E.g the following. in traditional solver you use backsub for QR solve for instance. In LU solve you use both back sub and front sub. I could get LU. Unfortunately, it actually starts at the end due to how it solves it. The same is true for LDL which would employ both. That doesn't preclude that fact there may be more efficient ways of solving whatever you have.
function [Q,R] = qrcgs(A)
%Classical Gram Schmidt for an m x n matrix
[m,n] = size(A);
% Generates the Q, R matrices
Q = zeros(m,n);
R = zeros(n,n);
for k = 1:n
% Assign the vector for normalization
w = A(:,k);
for j=1:k-1
% Gets R entries
R(j,k) = Q(:,j)'*w;
end
for j = 1:k-1
% Subtracts off orthogonal projections
w = w-R(j,k)*Q(:,j);
end
% Normalize
R(k,k) = norm(w);
Q(:,k) = w./R(k,k);
end
end
function x = backsub(R,b)
% Backsub for upper triangular matrix.
[m,n] = size(R);
p = min(m,n);
x = zeros(n,1);
for i=p:-1:1
% Look from bottom, assign to vector
r = b(i);
for j=(i+1):p
% Subtract off the difference
r = r-R(i,j)*x(j);
end
x(i) = r/R(i,i);
end
end
The method mldivide, generally represented as \ accepts solving many systems with the same A at once.
x = A\[b1 b2 b3 b4] # where bi are vectors with n rows
Solves the system for each b, and will return an nx4 matrix, where each column is the solution of each b. Calling mldivide like this should improve efficiency becaus the descomposition is only done once.
As in many decompositions like LU od LDL' (and in the one you are interested in particular) the matrix multiplying x is upper diagonal, the first value to be solved is x(n). However, having to do the LDL' decomposition, a simple backwards substitution algorithm won't be the bottleneck of the code. Therefore, the decomposition can be saved in order to avoid repeating the calculation for every bi. Thus, the code would look similar to this:
[LA,DA] = ldl(A);
DA = sparse(DA);
% LA = sparse(LA); %LA can also be converted to sparse matrix
% loop over bi
xi = LA'\(DA\(LA\bi));
% end loop
As you can see in the documentation of mldivide (Algorithms section), it performs some checks on the input matrixes, and having defined LA as full and DA as sparse, it should directly go for a triangular solver and a tridiagonal solver. If LA was converted to sparse, it would use a triangular solver too, and I don't know if the conversion to sparse would represent any improvement.

Computing only necessary rows of a matrix product

Suppose I have a large (but possibly sparse) matrix A, which is K-by-K in dimension. I have another K-by-1 vector, b.
Let Ax=b. If I am only interested in the first n rows, where n < K, of x, then one way of dealing with this in MATLAB is to calculate x=A\b and take the first n elements.
If the dimension K is so large that the entire computation infeasible, is there any other way to get these elements?
I guess one way would be to rearrange the columns of A and rows of x so that the elements you are interested in occur at the end of x. Then you would reduce [A,b] to row echelon form. Finally, to get the components you are after, you take the lower right hand nxn submatrix of the modified A (let's call it An) and you solve the reduced system An * xn = bn, where xn denotes the submarine of x that you are interested in, and bn denotes the last n rows of b after the row echelon reduction.
I mean, the conversion here to echelon form is still expensive, but you don't need to solve for the rest of the components in x, which can save you time.
Just an idea: You could try to use Block Matrix inversion: if you block your matrix into A = [A11, A12;A21, A22], where A11 is n x n, you can compute the blocks of its inverse B = inv(A) = [B11, B12;B21, B22] via Block Matrix Inversion. There are different versions of it, you could use the one where the Schur complement you use is only of size n x n. I'm not quite sure whether it is possible to avoid any inversion that scales with K, but you could look into it.
Your solution is then x(1:n) = [B11, B12]*b. It saves you from ever computing B21, B22. Still, I'm not sure if it is worth it. Depends on the dimensions I guess.
Here is one version, though this still needs the inverse of A22 which is (K-n)x(K-n):
K = 100;
n = 10;
A = randn(K,K);
b = randn(K,1);
% reference version: full inverse
xfull = inv(A)*b;
% blocks of A
A11 = A(1:n,1:n);A12 = A(1:n,n+1:K);A21 = A(n+1:K,1:n);A22 = A(n+1:K,n+1:K);
% blocks of inverse
A22i = inv(A22); % not sure if this can be avoided
B11 = inv(A11 - A12*A22i*A21);
B12 = -B11*A12*A22i;
% solution
x_n = [B11,B12]*b;
disp(x_n - xfull(1:n))
edit: Of course, this computes the inverse "explicitly" and as such is probably much slower than just solving the LSE. It could be worth it, if you had several vectors b you want to fit for a fixed A.

How can I optimize this integration code in matlab?

Using the Matlab Profiler I found that this line of code is creating a large bottleneck and slowing down my program. w,x,y,z are all 3D matrices containing the same dimensions (A x B x C) where A does not equal B and does not equal C. Is there any way to optimize this line of code to run faster?
dt = .5;
for t = 1: tstop
w(:,:,t+1)= sum( dt*(x(:,:,t:-1:1).*(y(:,:,1:t) - .002).*z(:,:,1:t)),3);
end
If you group some terms outside the for loop, you can get up to a 2x boost:
p = dt*(y - .002).*z;
for t = 1: tstop
w(:,:,t+1)= sum( x(:,:,t:-1:1).*p(:,:,1:t), 3);
end
It is now easier to notice that we are computing convolutions of x and p along the third dimension. If that dimension C (or tstop) is large, you can try to inline or optimize those convolutions.
I would reshape the 3D matrices into 2D ones, grouping the first 2 dimensions and keeping the time dimension as the second one. Then you can try to perform row-wise convolution with conv2 (if possible, as claimed in this answer), of fft. Find below a solution with fft (and zero-padding), assuming tstop = C:
X = reshape(x, [A*B, C]); % reshape to 2D
Y = reshape(y, [A*B, C]);
Z = reshape(z, [A*B, C]);
P = dt*(Y - .002).*Z; % grouped terms
z__ = zeros(A*B, C); % zero-padding
W = real(ifft(fft([z__, X]').*fft([z__, P]'))'); % column-wise fft
W = [zeros(A*B, 1), W(:, 1:C)]; % first half
w = reshape(W, [A, B, C+1]);
The results are the same, and depending of A,B,C, this can give you a big performance boost. Example with A=13, B=14, C=1155:
original: 1.026312 seconds
grouping terms: 0.509862 seconds
FFT: 0.033699 seconds

Loopless Gaussian mixture model in Matlab

I have several Gaussian distributions and I want to draw different values from all of them at the same time. Since this is basically what a GMM does, I have looked into Matlab GMM implementation (gmrnd) and I have seen that it performs a simple loop over all the components.
I would like to implement it in a faster way, but the problem is that 3d matrices are involved. A simple code (with loop) would be
n = 10; % number of Gaussians
d = 2; % dimension of each Gaussian
mu = rand(d,n); % init some means
U = rand(d,d,n); % init some covariances with their Cholesky decomposition (Cov = U'*U)
I = repmat(triu(true(d,d)),1,1,n);
U(~I) = 0;
r = randn(d,n); % random values for drawing samples
samples = zeros(d,n);
for i = 1 : n
samples(:,i) = U(:,:,i)' * r(:,i) + mu(:,i);
end
Is it possible to speed it up? I do not know how to deal with the 3d covariances matrix (without using cellfun, which is much slower).
Few improvements (hopefully are improvements) could be suggested here.
PARTE #1 You can replace the following piece of code -
I = repmat(triu(true(d,d)),[1,1,n]);
U(~I) = 0;
with bsxfun(#times,..) one-liner -
U = bsxfun(#times,triu(true(d,d)),U)
PARTE #2 You can kill the loopy portion of the code again with bsxfun(#times,..) like so -
samples = squeeze(sum(bsxfun(#times,U,permute(r,[1 3 2])),2)) + mu
I'm not fully convinced this is faster, but it gets rid of the loop. It would be interesting to see benchmarking results if you can do that. I also think this code makes is rather ugly and it's a bit hard to deduce what's going on, but I'll let you decide between readability and performance.
Anyway, I decided to define a big n*d dimensional Gaussian where each block d of variates are independent of each other (as in the original). This allows defining the covariance as a block diagonal matrix, for which I use blkdiag. From there, it is a matter of applying bsxfun to remove the need for looping.
Using the same random seed, I can recover the same samples as your code:
%// sampling with block diagonal covariance matrix
rng(1) %// set random seed
Ub = mat2cell(U, d, d, ones(n,1)); %// 1-by-1-by-10 cell of 2-by-2 matrices
C = blkdiag(Ub{:});
Ns = 1; %// number of samples
joint_samples = bsxfun(#plus, C'*randn(d*n, Ns), mu(:));
new_samples = reshape(joint_samples, [d n]); %// or [d n Ns] if Ns > 1
%//Compare to original
rng(1) %// set same seed for repeatability
r = randn(d,n); % random values for drawing samples
samples = zeros(d,n);
for i = 1 : n
samples(:,i) = U(:,:,i)' * r(:,i) + mu(:,i);
end
isequal(samples, new_samples) %// true

Multiply an arbitrary number of matrices an arbitrary number of times

I have found several questions/answers for vectorizing and speeding up routines for multiplying a matrix and a vector in a single loop, but I am trying to do something a little more general, namely multiplying an arbitrary number of matrices together, and then performing that operation an arbitrary number of times.
I am writing a general routine for calculating thin-film reflection from an arbitrary number of layers vs optical frequency. For each optical frequency W each layer has an index of refraction N and an associated 2x2 transfer matrix L and 2x2 interface matrix I which depends on the index of refraction and the thickness of the layer. If n is the number of layers, and m is the number of frequencies, then I can vectorize the index into an n x m matrix, but then in order to calculate the reflection at each frequency, I have to do nested loops. Since I am ultimately using this as part of a fitting routine, anything I can do to speed it up would be greatly appreciated.
This should provide a minimum working example:
W = 1260:0.1:1400; %frequency in cm^-1
N = rand(4,numel(W))+1i*rand(4,numel(W)); %dummy complex index of refraction
D = [0 0.1 0.2 0]/1e4; %thicknesses in cm
[n,m] = size(N);
r = zeros(size(W));
for x = 1:m %loop over frequencies
C = eye(2); % first medium is air
for y = 2:n %loop over layers
na = N(y-1,x);
nb = N(y,x);
%I = InterfaceMatrix(na,nb); % calculate the 2x2 interface matrix
I = [1 na*nb;na*nb 1]; % dummy matrix
%L = TransferMatrix(nb) % calculate the 2x2 transfer matrix
L = [exp(-1i*nb*W(x)*D(y)) 0; 0 exp(+1i*nb*W(x)*D(y))]; % dummy matrix
C = C*I*L;
end
a = C(1,1);
c = C(2,1);
r(x) = c/a; % reflectivity, the answer I want.
end
Running this twice for two different polarizations for a three layer (air/stuff/substrate) problem with 2562 frequencies takes 0.952 seconds while solving the exact same problem with the explicit formula (vectorized) for a three layer system takes 0.0265 seconds. The problem is that beyond 3 layers, the explicit formula rapidly becomes intractable and I would have to have a different subroutine for each number of layers while the above is completely general.
Is there hope for vectorizing this code or otherwise speeding it up?
(edited to add that I've left several things out of the code to shorten it, so please don't try to use this to actually calculate reflectivity)
Edit: In order to clarify, I and L are different for each layer and for each frequency, so they change in each loop. Simply taking the exponent will not work. For a real world example, take the simplest case of a soap bubble in air. There are three layers (air/soap/air) and two interfaces. For a given frequency, the full transfer matrix C is:
C = L_air * I_air2soap * L_soap * I_soap2air * L_air;
and I_air2soap ~= I_soap2air. Thus, I start with L_air = eye(2) and then go down successive layers, computing I_(y-1,y) and L_y, multiplying them with the result from the previous loop, and going on until I get to the bottom of the stack. Then I grab the first and third values, take the ratio, and that is the reflectivity at that frequency. Then I move on to the next frequency and do it all again.
I suspect that the answer is going to somehow involve a block-diagonal matrix for each layer as mentioned below.
Not next to a matlab, so that's only a starter,
Instead of the double loop you can write na*nb as Nab=N(1:end-1,:).*N(2:end,:);
The term in the exponent nb*W(x)*D(y) can be written as e=N(2:end,:)*W'*D;
The result of I*L is a 2x2 block matrix that has this form:
M = [1, Nab; Nab, 1]*[e-, 0;0, e+] = [e- , Nab*e+ ; Nab*e- , e+]
with e- as exp(-1i*e), and e+ as exp(1i*e)'
see kron on how to get the block matrix form, to vectorize the propagation C=C*I*L just take M^n
#Lama put me on the right path by suggesting block matrices, but the ultimate answer ended up being more complicated, and so I put it here for posterity. Since the transfer and interface matrix is different for each layer, I leave in the loop over the layers, but construct a large sparse block matrix where each block represents a frequency.
W = 1260:0.1:1400; %frequency in cm^-1
N = rand(4,numel(W))+1i*rand(4,numel(W)); %dummy complex index of refraction
D = [0 0.1 0.2 0]/1e4; %thicknesses in cm
[n,m] = size(N);
r = zeros(size(W));
C = speye(2*m); % first medium is air
even = 2:2:2*m;
odd = 1:2:2*m-1;
for y = 2:n %loop over layers
na = N(y-1,:);
nb = N(y,:);
% get the reflection and transmission coefficients from subroutines as a vector
% of length m, one value for each frequency
%t = Tab(na, nb);
%r = Rab(na, nb);
t = rand(size(W)); % dummy vector for MWE
r = rand(size(W)); % dummy vector for MWE
% create diagonal and off-diagonal elements. each block is [1 r;r 1]/t
Id(even) = 1./t;
Id(odd) = Id(even);
Io(even) = 0;
Io(odd) = r./t;
It = [Io;Id/2].';
I = spdiags(It,[-1 0],2*m,2*m);
I = I + I.';
b = 1i.*(2*pi*D(n).*nb).*W;
B(even) = -b;
B(odd) = b;
L = spdiags(exp(B).',0,2*m,2*m);
C = C*I*L;
end
a = spdiags(C,0);
a = a(odd).';
c = spdiags(C,-1);
c = c(odd).';
r = c./a; % reflectivity, the answer I want.
With the 3 layer system mentioned above, it isn't quite as fast as the explicit formula, but it's close and probably can get a little faster after some profiling. The full version of the original code clocks at 0.97 seconds, the formula at 0.012 seconds and the sparse diagonal version here at 0.065 seconds.