I am calculating the solution of a constrained linear least-squares problem as follows:
lb = zeros(7,1);
ub = ones(7,1);
for i = 1:size(b,2)
x(:,i) = lsqlin(C,b(:,i),[],[],[],[],lb,ub);
end
where C is m x 7 and b is m x n. n is quite large leading to a slow computation time. Is there any way to speed up this procedure and get rid of the slow for loop. I am using lsqlin instead of pinv or \ because I need to constrain my solution to the boundaries of 0ā1 (lb and ub).
The for loop is not necessarily the reason for any slowness ā you're not pre-allocating and lsqlin is probably printing out a lot of stuff on each iteration. However, you may be able to speed this up by turning your C matrix into a sparse block diagonal matrix, C2, with n identical blocks (see here). This solves all n problems in one go. If the new C2 is not sparse you may use a lot more memory and the computation may take much longer than with the for loop.
n = size(b,2);
C2 = kron(speye(n),C);
b2 = b(:);
lb2 = repmat(lb,n,1); % or zeros(7*n,1);
ub2 = repmat(ub,n,1); % or ones(7*n,1);
opts = optimoptions(#lsqlin,'Algorithm','interior-point','Display','off');
x = lsqlin(C2,b2,[],[],[],[],lb2,ub2,[],opts);
Using optimoptions, I've specified the algorithm and set 'Display' to 'off' to make sure any outputs and warnings don't slow down the calculations.
On my machine this is 6ā10 times faster than using a for loop (with proper pre-allocation and setting options). This approach assumes that the sparse C2 matrix with m*n*7 elements can fit in memory. If not, a for loop based approach will be the only option (other than writing your own specialized version of lsqlin or taking advantage any other spareness in the problem).
Related
I have a special algorithm where as one of the lasts steps I need to carry out a multiplication of a 3-D array with a 2-D array such that each matrix-slice of the 3-D array is multiplied wich each column of the 2-D array. In other words, if, say A is an N x N x N matrix and B is an N x N matrix, I need to compute a matrix C of size N x N where C(:,i) = A(:,:,i)*B(:,i);.
The naive way to implement this is a loop, i.e.,
C = zeros(N,N);
for i = 1:N
C(:,i) = A(:,:,i)*B(:,i);
end
However, loops aren't the fastest in Matlab and should be avoided. I'm looking for faster ways of doing this. Right now, what I do is to use the fact that (now Mathjax would be great!):
[A1 b1, A2 b2, ..., AN bN] = [A1, A2, ..., AN]*blkdiag(b1,b2,...,bN)
This allows to get rid of the loop, however, we have to create a block-diagonal matrix of size N^2 x N. I'm making it via sparse to be efficient, i.e., like this:
A_long = reshape(A,N,N^2);
b_cell = mat2cell(B,N,ones(1,N)); % convert matrix to cell array of vectors
b_cell{1} = sparse(b_cell{1}); % make first element sparse, this is enough to trigger blkdiag into sparse mode
B_blk = blkdiag(b_cell{:});
C = A_long*B_blk;
According to my benchmarks, this approach is faster than the loop by a factor of around two (for large N), despite the necessary preparations (the multiplication alone is 3 to 4-fold faster than the loop).
Here is a quick benchmark I did, varying the problem size N and measuring the time for the loop and the alternative approach (with and without the preparation steps). For large N the speedup is around 2...2.5.
Still, this looks awfully complicated to me. Is there a simpler or better way to achieve this? This looks like it's a quite generic/standard problem so I could imagine that solutions are around, I just don't know what to search for really.
P.S.: blkdiag(A1,...,AN)*B is an obvious alternative but here the block diagonal is already N^2 x N^2 so I don't think it can be better than what I did.
edit: Thanks to everyone for commenting! I have carried out a new benchmark on a Matlab R2016b. Unfortunately, I do not have both versions on the same computer so we cannot compare the absolute numbers but the relative comparison is still interesting, since it has changed a bit. Here it is:
And here is a zoom on the high-N area:
Couple of observations:
SumRepDot is the solution proposed by Divakar, namely, to use squeeze(sum(bsxfun(#times,A,permute(B,[3,1,2])),2)) which on R2016b simplifies to squeeze(sum(A.*permute(B,[3,1,2]),2)). It is faster than the loop for high N by a factor of around 1.2...1.4.
The loop is still "slow" in a sense that the multiplication with the sparse block diagonal matrix is much faster.
For the latter, the preparation overhead seems to become negligible for high N which makes it overall a factor of 3...4 faster than the loop. This is a nice result.
I'm writing a Matlab code that needs to calculate distances of vectors and I execute
X = norm(A(:,i)-B(:,j));
%do something with X
%loop over i and j
quite often. It is a relatively small computation so it is not really suitable for parfor, so I thought the best idea would be to implement it with the gpu functions.
I found that pagefun and arrayfun do something like what I want, but they execute element-wise operations and not on vectors.
So my question is, is there a more clever way of calculating norms without for loops? Or if I actually need to use gpu, what is the best way to do it?
If you need the norm between all the elements of A and B, the fastest way is probably this:
N = 1000; % number of elements
dim = 3; % number of dimensions
A = rand(dim,N, 'gpuArray');
B = rand(dim,1,N, 'gpuArray');
C = sqrt(squeeze(sum(bsxfun(#minus, A, B).^2))); % C(i,j) norm
I get the execution time of 0.16 seconds on CPU and 0.13 seconds on the GPU.
Does Matlab do a full matrix multiplication when a matrix multiplication is given as an argument to the trace function?
For example, in the code below, does A*B actually happen, or are the columns of B dotted with the rows of A, then summed? Or does something else happen?
A = [2,2;2,2];
B = eye(2);
f = trace(A*B);
Yes, MATLAB calculates the product, but you can avoid it!
First, let's see what MATLAB does if you do f = trace(A*B):
I think the picture from my Performance monitor says it all really. The first bump is when I created a large A = 2*ones(n), the second, very little bump is for the creation of B = eye(n), and the last bump is where f = trace(A*B) is calculated.
Now, let's see that you get if you do it manually:
If you do it manually, you can save a lot of memory, and it's much faster.
tic
n = 6e3;
A = rand(n);
B = rand(n);
f = trace(A*B);
toc
pause(10)
tic
C(n) = 0;
for ii = 1:n
C(ii) = sum(A(ii,:)*B(:,ii));
end
g = sum(C);
toc
abs(f-g) < 1e-10
Elapsed time is 11.982804 seconds.
Elapsed time is 0.540285 seconds.
ans =
1
Now, as you asked about in the comments: "Is this still true if you use it in a function where optimization can kick in?"
This depends on what you mean here, but as a quick example:
Calculating x = inv(A)*b can be done in a few different ways. If you do:
x = A\b;
MATLAB will chose an algorithm that's best suited for your particular matrix/vector. There are many different alternatives here, depending on the structure of the matrix: is it triangular, hermatian, sparse...? Often it's a upper/lower triangulation. I can pretty much guarantee you that you can't write a code in MATLAB that can outperform MATLABs builtin functions here.
However, if you calculate the same thing this way:
x = inv(A)*b;
MATLAB will actually calculate the inverse of A, then multiply it by b, even though the inverse is not stored in the workspace afterwards. This is much slower, and can also be inaccurate. (In the A\b approach, MATLAB will, if necessary create a permutation matrix to ensure numerical stability.
I have a cell array myBasis of sparse matricies B_1,...,B_n.
I want to evaluate with Matlab the matrix Q(i,j) = trace (B^T_i * B_j).
Therefore, I wrote the following code:
for i=1:n
for j=1:n
B=myBasis{i};
C=myBasis{j};
Q(i,j)=trace(B'*C);
end
end
Which takes already 68 seconds when n=1226 and B_i has 50 rows, and 50 colums.
Is there any chance to speed this up? Usually I exclude for-loops from my matlab code in a c++ file - but I have no experience how to handle a sparse cell array in C++.
As noted by Inox Q is symmetric and therefore you only need to explicitly compute half the entries.
Computing trace( B.'*C ) is equivalent to B(:).'*C(:):
trace(B.'*C) = sum_i [B.'*C]_ii = sum_i sum_j B_ij * C_ij
which is the sum of element-wise products and therefore equivalent to B(:).'*C(:).
When explicitly computing trace( B.'*C ) you are actually pre-computing all k-by-k entries of B.'*C only to use the diagonal later on. AFAIK, Matlab does not optimize its calculation to save it from computing all the entries.
Here's a way
for ii = 1:n
B = myBasis{ii};
for jj = ii:n
C = myBasis{jj};
t = full( B(:).'*C(:) ); % equivalent to trace(B'*C)!
Q(ii,jj) = t;
Q(jj,ii) = t;
end
end
PS,
It is best not to use i and j as variable names in Matlab.
PPS,
You should notice that ' operator in Matlab is not matrix transpose, but hermitian conjugate, for actual transpose you need to use .'. In most cases complex numbers are not involved and there is no difference between the two operators, but once complex data is introduced, confusing between the two operators makes debugging quite a mess...
Well, a couple of thoughts
1) Basic stuff: A'*B = (B'*A)' and trace(A) = trace(A'). Well, only this trick cut your calculations by almost 50%. Your Q(i,j) matrix is symmetric, and you only need to calculate n(n+1)/2 terms (and not nĀ²)
2) To calculate the trace you don't need to calculate every term of B'*C, just the diagonal. Nevertheless, I don't know if it's easy to create a script in Matlab that is actually faster then just calculating B'*C (MatLab is pretty fast with matrix operations).
But I would definitely implement (1)
I want to make the following code vectorized:(where fun is a custom function)
m = zeros(R,C);
for r = 1:R
for c = 1:C
m(r,c) = fun(r,c);
end
end
Any help would be appreciated.
Just to make it clear, there is no generic "vectorized" solution if fun does not accept vectors (or matrices) for input.
That said, I'll add to nate's answer and say that in case fun does not accept matrices you can go about this with:
[Y, X] = meshgrid(1:R, 1:C);
m = arrayfun(#(r, c)fun(r, c), X, Y)
However you should note that this is not a vectorized solution as arrayfun has a for-loop under the hood, so while it may be prettier it is probably slower.
use meshgrid:
N = 100 % grid points
rangex=linspace(-2,2,N);
rangey=linspace(-2,2,N);
[x,y] = meshgrid(rangex,rangey);
%G=fun(x,y);
G= exp(-(x.^2+y.^2));
imagesc(G)
There's a few ways to do this:
G = #(x,y) exp(-(x.*x+y.*y));
% using meshgrid
% PROS: short, very fast, works only on functions that accept vector/matrix input
% CONST: very large memory footprint
[x,y] = meshgrid(-10:0.1:10);
m = G(x,y);
% using arrayfun
% PROS: shorter notation than loop, works on functions taking only scalars
% CONS: can be prohibitively slow, especially when nested like this
m = cell2mat(...
arrayfun(#(x)...
arrayfun(#(y) G(x,y), -10:0.1:10),...
-10:0.1:10, 'uniformoutput', false));
% using for-loop
% PROS: intuitive to most programmers, works on functions taking scalars only
% CONS: Boilerplate can grow large, can be slow when the function G(x,y)
% is not "inlined" due to limitations in JIT
for ii = 1:R
for jj = 1:C
m(ii,jj) = exp(-(ii*ii+jj*jj)); % inlined
m(ii,jj) = G(ii,jj); % NOT inlined (slower)
end
end
Note that the meshgrid is way faster than arrayfun and the loop, but has the potential to fill up your memory so much that it is impossible to use this method for higher resolutions in the x or y ranges (without resorting to some sort of block-processing scheme).
I will state here that arrayfun is generally a thing to be avoided, since it is often far slower than the loop-counterpart, partly due to JIT acceleration of the loop, and partly because of the overhead involved with anonymous functions (nested triply, in this case).
So, for the dblquad example you mentioned in a comment: just using a loop is easiest and fastest.
Several Matlab functions can work with matrices as inputs and they give you matrices as outputs. But if fun is custom is easier even! you can actually make fun to accept matrices as inputs (it depends in what you are doing of course, sometimes you just can't, but most of times you can) and it will work. Most of the times the difference of accepting matrices or just numbers resides in substituting * by .* (and the same with other operators). Try:
m=[]; %not necesary in this case
r=1:R;
c=1:C;
m=fun(r,c);