I'm working on a piece of software in MATLAB and I believe I've reached the limit of my knowledge when it comes to optimisation and efficiency. Here's where the expertise of the people on StackOverflow might be helpful.
Using MATLAB's profiler, I've found that the last inefficient line of code is a multiplication of the following form:
function [energy] = getEnergy(S,W)
energy = -(S*W*S');
end
S is a 1xN row vector, W is an NxN matrix (it's not just a diagonal matrix though), and S' is a Nx1 column vector, whose multiplication returns a number.
I understand that this is a primitive operation, but I was wondering whether there is any way to speed this up.
I tried searching Google etc, but unfortunately I do not know the right keywords to search for. I apologise if this is a duplicate.
Thanks in advance.
Your implementation is correct, and the fastest.
You can save ~20-30% of computation time by performing it inside the main code, without call to the function.
>> S = randn(1, 500);
>> W = randn(500);
>> tic; for k = 1 : 10000, e = -(S * W * S'); end; toc
Elapsed time is 0.321595 seconds.
If the bottleneck stems from the fact that you need to repeat this computation for a LOT of different vectors S, then you can do the following vectorization:
% s is k-by-N matrix of k row vectors
energy = sum( ( s * W ) .* s, 2 ); % note the .* in the middle!
Related
J = 0;
sumTerm = 0;
for i=1:m
sumTerm = sumTerm + ((theta(1)+theta(2)*X(i))-y(i)).^2;
end
J = (1/2*m)*sumTerm;
Is this the right way to do summation ?
How about this:
J = 0.5 * sum(((theta(1)*ones(size(X))+theta(2)*X)-y).^2)/m
Or as #rayryeng pointed out, you can even drop the ones
J = 0.5 * sum(((theta(1)+theta(2)*X)-y).^2)/m
That's correct, but you'll want to implement that vectorized instead of using loops. You can take advantage of this by using linear algebra to compute the sum for you. You can compute theta(1) + theta(2)*X(i) - y(i) for each term by first creating the matrix X that is a matrix of points where the first column is appended with all ones and the next column contains your single feature / data points. You would finally compute the difference between the output from the prediction line and the true output for each data point by X*theta - y which would thus produce a vector of differences for each data point. This is also assuming that your array of points and theta are both column vectors, and I believe that this is the right structure since this looks like you're implementing the cost function for univariate linear regression from Andrew Ng's Machine Learning course.
You can then compute the dot product of this vector with itself to compute the sum of square differences, then you can divide by 2*m when you're done:
vec = [ones(m,1) X]*theta - y;
J = (vec.'*vec) / (2*m); %'
The reason why you should pursue a linear algebra solution instead is because native matrix and vector operations in MATLAB are very, very fast and if you can find a solution to your computational problems with linear algebra, it'll be the fastest you can ever get your code to compute things.
For example, see this post on why matrix multiplication in MATLAB is amongst the fastest when benchmarking with other platforms: Why is MATLAB so fast in matrix multiplication?
Does Matlab do a full matrix multiplication when a matrix multiplication is given as an argument to the trace function?
For example, in the code below, does A*B actually happen, or are the columns of B dotted with the rows of A, then summed? Or does something else happen?
A = [2,2;2,2];
B = eye(2);
f = trace(A*B);
Yes, MATLAB calculates the product, but you can avoid it!
First, let's see what MATLAB does if you do f = trace(A*B):
I think the picture from my Performance monitor says it all really. The first bump is when I created a large A = 2*ones(n), the second, very little bump is for the creation of B = eye(n), and the last bump is where f = trace(A*B) is calculated.
Now, let's see that you get if you do it manually:
If you do it manually, you can save a lot of memory, and it's much faster.
tic
n = 6e3;
A = rand(n);
B = rand(n);
f = trace(A*B);
toc
pause(10)
tic
C(n) = 0;
for ii = 1:n
C(ii) = sum(A(ii,:)*B(:,ii));
end
g = sum(C);
toc
abs(f-g) < 1e-10
Elapsed time is 11.982804 seconds.
Elapsed time is 0.540285 seconds.
ans =
1
Now, as you asked about in the comments: "Is this still true if you use it in a function where optimization can kick in?"
This depends on what you mean here, but as a quick example:
Calculating x = inv(A)*b can be done in a few different ways. If you do:
x = A\b;
MATLAB will chose an algorithm that's best suited for your particular matrix/vector. There are many different alternatives here, depending on the structure of the matrix: is it triangular, hermatian, sparse...? Often it's a upper/lower triangulation. I can pretty much guarantee you that you can't write a code in MATLAB that can outperform MATLABs builtin functions here.
However, if you calculate the same thing this way:
x = inv(A)*b;
MATLAB will actually calculate the inverse of A, then multiply it by b, even though the inverse is not stored in the workspace afterwards. This is much slower, and can also be inaccurate. (In the A\b approach, MATLAB will, if necessary create a permutation matrix to ensure numerical stability.
I have a cell array myBasis of sparse matricies B_1,...,B_n.
I want to evaluate with Matlab the matrix Q(i,j) = trace (B^T_i * B_j).
Therefore, I wrote the following code:
for i=1:n
for j=1:n
B=myBasis{i};
C=myBasis{j};
Q(i,j)=trace(B'*C);
end
end
Which takes already 68 seconds when n=1226 and B_i has 50 rows, and 50 colums.
Is there any chance to speed this up? Usually I exclude for-loops from my matlab code in a c++ file - but I have no experience how to handle a sparse cell array in C++.
As noted by Inox Q is symmetric and therefore you only need to explicitly compute half the entries.
Computing trace( B.'*C ) is equivalent to B(:).'*C(:):
trace(B.'*C) = sum_i [B.'*C]_ii = sum_i sum_j B_ij * C_ij
which is the sum of element-wise products and therefore equivalent to B(:).'*C(:).
When explicitly computing trace( B.'*C ) you are actually pre-computing all k-by-k entries of B.'*C only to use the diagonal later on. AFAIK, Matlab does not optimize its calculation to save it from computing all the entries.
Here's a way
for ii = 1:n
B = myBasis{ii};
for jj = ii:n
C = myBasis{jj};
t = full( B(:).'*C(:) ); % equivalent to trace(B'*C)!
Q(ii,jj) = t;
Q(jj,ii) = t;
end
end
PS,
It is best not to use i and j as variable names in Matlab.
PPS,
You should notice that ' operator in Matlab is not matrix transpose, but hermitian conjugate, for actual transpose you need to use .'. In most cases complex numbers are not involved and there is no difference between the two operators, but once complex data is introduced, confusing between the two operators makes debugging quite a mess...
Well, a couple of thoughts
1) Basic stuff: A'*B = (B'*A)' and trace(A) = trace(A'). Well, only this trick cut your calculations by almost 50%. Your Q(i,j) matrix is symmetric, and you only need to calculate n(n+1)/2 terms (and not n²)
2) To calculate the trace you don't need to calculate every term of B'*C, just the diagonal. Nevertheless, I don't know if it's easy to create a script in Matlab that is actually faster then just calculating B'*C (MatLab is pretty fast with matrix operations).
But I would definitely implement (1)
I have two matrices A and B and what I want to get is:
trace(A*B)
If I'm not mistaken this is called Frobenius inner product.
My concern here is about efficiency. I'm just afraid that this strait-forward approach will first do the whole multiplication (my matrices are thousands of rows/cols) and only then take the trace of the product, while the operation I really need is much simplier. Is there a function or a syntax to do this efficiently?
Correct...summing the element-wise products will be quicker:
n = 1000
A = randn(n);
B = randn(n);
tic
sum(sum(A .* B));
toc
tic
sum(diag(A * B'));
toc
Elapsed time is 0.010015 seconds.
Elapsed time is 0.130514 seconds.
sum(sum(A.*B)) avoids doing the full matrix multiplication
How about using vector multiplication?
(A(:)')*B(:)
Run time check
Comparing four options with A and B of size 1000-by-1000:
1. vector inner product: A(:)'*B(:) (this answer) took only 0.0011 sec.
2. Using element wise multiplication sum(sum(A.*B)) (John's answer) took 0.0035 sec.
3. Trace trace(A*B') (proposed by OP) took 0.054 sec.
4. Sum of diagonal sum(diag(A*B')) (option rejected by John) took 0.055 sec.
Take home message: Matlab is extremely efficient when it comes to matrix/vector product. Using vector inner product is x3 times faster even than the efficient element-wise multiplication solution.
Benchmark code
Code used to provide the run time checks
t=zeros(1,4);
n=1000; % size of matrices
it=100; % average results over XX trails
for ii=1:it,
% random inputs
A=rand(n);
B=rand(n);
% John's rejected solution
tic;
n1=sum(diag(A*B'));
t(1)=t(1)+toc;
% element-wise solution
tic;
n2=sum(sum(A.*B));
t(2)=t(2)+toc;
% MOST efficient solution - using vector product
tic;
n3=A(:)'*B(:);
t(3)=t(3)+toc;
% using trace
tic;
n4=trace(A*B');
t(4)=t(4)+toc;
% make sure everything is correct
assert(abs(n1-n2)<1e-8 && abs(n3-n4)<1e-8 && abs(n1-n4)<1e-8);
end;
t./it
You can now run this benchmark in a click.
I need to perform lots of evaluations of the form
X(:,i)' * A * X(:,i) i = 1...n
where X(:,i) is a vector and A is a symmetric matrix. Ostensibly, I can either do this in a loop
for i=1:n
z(i) = X(:,i)' * A * X(:,i)
end
which is slow, or vectorise it as
z = diag(X' * A * X)
which wastes RAM unacceptably when X has a lot of columns. Currently I am compromising on
Y = A * X
for i=1:n
z(i) = Y(:,i)' * X(:,i)
end
which is a little faster/lighter but still seems unsatisfactory.
I was hoping there might be some matlab/scilab idiom or trick to achieve this result more efficiently?
Try this in MATLAB:
z = sum(X.*(A*X));
This gives results equivalent to Federico's suggestion using the function DOT, but should run slightly faster. This is because the DOT function internally computes the result the same way as I did above using the SUM function. However, DOT also has additional input argument checks and extra computation for cases where you are dealing with complex numbers, which is extra overhead you probably don't want or need.
A note on computational efficiency:
Even though the time difference is small between how fast the two methods run, if you are going to be performing the operation many times over it's going to start to add up. To test the relative speeds, I created two 100-by-100 matrices of random values and timed the two methods over many runs to get an average execution time:
METHOD AVERAGE EXECUTION TIME
--------------------------------------------
Z = sum(X.*Y); 0.0002595 sec
Z = dot(X,Y); 0.0003627 sec
Using SUM instead of DOT therefore reduces the execution time of this operation by about 28% for matrices with around 10,000 elements. The larger the matrices, the more negligible this difference will be between the two methods.
To summarize, if this computation represents a significant bottleneck in how fast your code is running, I'd go with the solution using SUM. Otherwise, either solution should be fine.
Try this:
z = dot(X, A*X)
I don't have Matlab here to test, but it works on Octave, so I expect Matlab to have an analogous dot() function.
From Octave's help:
-- Function File: dot (X, Y, DIM)
Computes the dot product of two vectors. If X and Y are matrices,
calculate the dot-product along the first non-singleton dimension.
If the optional argument DIM is given, calculate the dot-product
along this dimension.
For completeness, gnovice's answer in Scilab would be
z = sum(X .* Y, 1)'