I'm writing a Matlab code that needs to calculate distances of vectors and I execute
X = norm(A(:,i)-B(:,j));
%do something with X
%loop over i and j
quite often. It is a relatively small computation so it is not really suitable for parfor, so I thought the best idea would be to implement it with the gpu functions.
I found that pagefun and arrayfun do something like what I want, but they execute element-wise operations and not on vectors.
So my question is, is there a more clever way of calculating norms without for loops? Or if I actually need to use gpu, what is the best way to do it?
If you need the norm between all the elements of A and B, the fastest way is probably this:
N = 1000; % number of elements
dim = 3; % number of dimensions
A = rand(dim,N, 'gpuArray');
B = rand(dim,1,N, 'gpuArray');
C = sqrt(squeeze(sum(bsxfun(#minus, A, B).^2))); % C(i,j) norm
I get the execution time of 0.16 seconds on CPU and 0.13 seconds on the GPU.
Related
I'm trying to write a code in matlab that does basically the same as the build-in fft function. Hence computing the discrete fourier transform of any given input vector.
The transform is given by
% N
% X(k) = sum x(n)*exp(-j*2*pi*(k-1)*(n-1)/N), 1 <= k <= N.
% n=1
Now I created my own code to do this, but the computational effort is about a factor 200 when I look at the computation times. Obviously I would like to reduce this.
Below the computational part of my code, where y is the output vector.
N=length(input_vector)
for k = 1:N
y(k)=0;
for n = 1:N
term = input_vector(n)*exp(-2*pi*1i*(n-1)*(k-1)/N);
y(k)=y(k)+term;
end
end
Now I think the computation is heavy because of the for loops and the line with y(k)=y(k)+term, since this happens at all iterations. I reckon I should be able to make this smaller by either using vector/matrix notation or by using functions with dummy variables and then iterate these functions. But I don't know how to start this process.
Any help or suggestions would be much appreciated.
Using implicit expansion you can greatly reduce the computation time of your algorithm:
% Vector length
N = length(input_vector);
% Vectorized DFT algorithm
y = sum(input_vector.*exp(-2*pi*1i*[0:N-1].'*[0:N-1]/N),2);
There is however two downsides:
The vectorization will consume a lot of memory (since a vector N*N
have to be created)
It won't be faster than the built-in function.
I am calculating the solution of a constrained linear least-squares problem as follows:
lb = zeros(7,1);
ub = ones(7,1);
for i = 1:size(b,2)
x(:,i) = lsqlin(C,b(:,i),[],[],[],[],lb,ub);
end
where C is m x 7 and b is m x n. n is quite large leading to a slow computation time. Is there any way to speed up this procedure and get rid of the slow for loop. I am using lsqlin instead of pinv or \ because I need to constrain my solution to the boundaries of 0ā1 (lb and ub).
The for loop is not necessarily the reason for any slowness ā you're not pre-allocating and lsqlin is probably printing out a lot of stuff on each iteration. However, you may be able to speed this up by turning your C matrix into a sparse block diagonal matrix, C2, with n identical blocks (see here). This solves all n problems in one go. If the new C2 is not sparse you may use a lot more memory and the computation may take much longer than with the for loop.
n = size(b,2);
C2 = kron(speye(n),C);
b2 = b(:);
lb2 = repmat(lb,n,1); % or zeros(7*n,1);
ub2 = repmat(ub,n,1); % or ones(7*n,1);
opts = optimoptions(#lsqlin,'Algorithm','interior-point','Display','off');
x = lsqlin(C2,b2,[],[],[],[],lb2,ub2,[],opts);
Using optimoptions, I've specified the algorithm and set 'Display' to 'off' to make sure any outputs and warnings don't slow down the calculations.
On my machine this is 6ā10 times faster than using a for loop (with proper pre-allocation and setting options). This approach assumes that the sparse C2 matrix with m*n*7 elements can fit in memory. If not, a for loop based approach will be the only option (other than writing your own specialized version of lsqlin or taking advantage any other spareness in the problem).
Does Matlab do a full matrix multiplication when a matrix multiplication is given as an argument to the trace function?
For example, in the code below, does A*B actually happen, or are the columns of B dotted with the rows of A, then summed? Or does something else happen?
A = [2,2;2,2];
B = eye(2);
f = trace(A*B);
Yes, MATLAB calculates the product, but you can avoid it!
First, let's see what MATLAB does if you do f = trace(A*B):
I think the picture from my Performance monitor says it all really. The first bump is when I created a large A = 2*ones(n), the second, very little bump is for the creation of B = eye(n), and the last bump is where f = trace(A*B) is calculated.
Now, let's see that you get if you do it manually:
If you do it manually, you can save a lot of memory, and it's much faster.
tic
n = 6e3;
A = rand(n);
B = rand(n);
f = trace(A*B);
toc
pause(10)
tic
C(n) = 0;
for ii = 1:n
C(ii) = sum(A(ii,:)*B(:,ii));
end
g = sum(C);
toc
abs(f-g) < 1e-10
Elapsed time is 11.982804 seconds.
Elapsed time is 0.540285 seconds.
ans =
1
Now, as you asked about in the comments: "Is this still true if you use it in a function where optimization can kick in?"
This depends on what you mean here, but as a quick example:
Calculating x = inv(A)*b can be done in a few different ways. If you do:
x = A\b;
MATLAB will chose an algorithm that's best suited for your particular matrix/vector. There are many different alternatives here, depending on the structure of the matrix: is it triangular, hermatian, sparse...? Often it's a upper/lower triangulation. I can pretty much guarantee you that you can't write a code in MATLAB that can outperform MATLABs builtin functions here.
However, if you calculate the same thing this way:
x = inv(A)*b;
MATLAB will actually calculate the inverse of A, then multiply it by b, even though the inverse is not stored in the workspace afterwards. This is much slower, and can also be inaccurate. (In the A\b approach, MATLAB will, if necessary create a permutation matrix to ensure numerical stability.
I have the following minimal code:
N=30;
P=200;
a = lpc(signal,N);
y = zeros(1, P);
y(1:N) = x(1:N);
for ii=(N+1):P
y(ii) = -sum(a(2:end) .* y((ii-1):-1:(ii-N)));
end
the for loop in y is not efficient, is the a way to vectorize this? maybe a matlab related function ?
EDIT:
some more context to the question - I am trying to predict a known periodic signal efficiently using lpc. For a=lpc(signal,3) I found in matlab documentation that y=filter([0 -a(2:end)],1,x) would do, how do I generalize it to lpc(signal,N)?
I used the symbolic toolbox to print out the formula for any later y values. These formulas are very long and require about (ii-N)*N multiplications for step ii to calculate y directly. A vectorised solution would have to do all these multiplications, it will be slower.
Optimizing your loop is everything that can be done:
b=a(end:-1:2);
for ii=(N+1):P
y(ii) = -sum(b .* y((ii-N):(ii-1)));
end
Indexing backwards is slow.
I don't see an easily feasible way, as each position in y depends on its precedessors. So they have to be calculated step by step.
I have a cell array myBasis of sparse matricies B_1,...,B_n.
I want to evaluate with Matlab the matrix Q(i,j) = trace (B^T_i * B_j).
Therefore, I wrote the following code:
for i=1:n
for j=1:n
B=myBasis{i};
C=myBasis{j};
Q(i,j)=trace(B'*C);
end
end
Which takes already 68 seconds when n=1226 and B_i has 50 rows, and 50 colums.
Is there any chance to speed this up? Usually I exclude for-loops from my matlab code in a c++ file - but I have no experience how to handle a sparse cell array in C++.
As noted by Inox Q is symmetric and therefore you only need to explicitly compute half the entries.
Computing trace( B.'*C ) is equivalent to B(:).'*C(:):
trace(B.'*C) = sum_i [B.'*C]_ii = sum_i sum_j B_ij * C_ij
which is the sum of element-wise products and therefore equivalent to B(:).'*C(:).
When explicitly computing trace( B.'*C ) you are actually pre-computing all k-by-k entries of B.'*C only to use the diagonal later on. AFAIK, Matlab does not optimize its calculation to save it from computing all the entries.
Here's a way
for ii = 1:n
B = myBasis{ii};
for jj = ii:n
C = myBasis{jj};
t = full( B(:).'*C(:) ); % equivalent to trace(B'*C)!
Q(ii,jj) = t;
Q(jj,ii) = t;
end
end
PS,
It is best not to use i and j as variable names in Matlab.
PPS,
You should notice that ' operator in Matlab is not matrix transpose, but hermitian conjugate, for actual transpose you need to use .'. In most cases complex numbers are not involved and there is no difference between the two operators, but once complex data is introduced, confusing between the two operators makes debugging quite a mess...
Well, a couple of thoughts
1) Basic stuff: A'*B = (B'*A)' and trace(A) = trace(A'). Well, only this trick cut your calculations by almost 50%. Your Q(i,j) matrix is symmetric, and you only need to calculate n(n+1)/2 terms (and not nĀ²)
2) To calculate the trace you don't need to calculate every term of B'*C, just the diagonal. Nevertheless, I don't know if it's easy to create a script in Matlab that is actually faster then just calculating B'*C (MatLab is pretty fast with matrix operations).
But I would definitely implement (1)