I have two matrices A and B and what I want to get is:
trace(A*B)
If I'm not mistaken this is called Frobenius inner product.
My concern here is about efficiency. I'm just afraid that this strait-forward approach will first do the whole multiplication (my matrices are thousands of rows/cols) and only then take the trace of the product, while the operation I really need is much simplier. Is there a function or a syntax to do this efficiently?
Correct...summing the element-wise products will be quicker:
n = 1000
A = randn(n);
B = randn(n);
tic
sum(sum(A .* B));
toc
tic
sum(diag(A * B'));
toc
Elapsed time is 0.010015 seconds.
Elapsed time is 0.130514 seconds.
sum(sum(A.*B)) avoids doing the full matrix multiplication
How about using vector multiplication?
(A(:)')*B(:)
Run time check
Comparing four options with A and B of size 1000-by-1000:
1. vector inner product: A(:)'*B(:) (this answer) took only 0.0011 sec.
2. Using element wise multiplication sum(sum(A.*B)) (John's answer) took 0.0035 sec.
3. Trace trace(A*B') (proposed by OP) took 0.054 sec.
4. Sum of diagonal sum(diag(A*B')) (option rejected by John) took 0.055 sec.
Take home message: Matlab is extremely efficient when it comes to matrix/vector product. Using vector inner product is x3 times faster even than the efficient element-wise multiplication solution.
Benchmark code
Code used to provide the run time checks
t=zeros(1,4);
n=1000; % size of matrices
it=100; % average results over XX trails
for ii=1:it,
% random inputs
A=rand(n);
B=rand(n);
% John's rejected solution
tic;
n1=sum(diag(A*B'));
t(1)=t(1)+toc;
% element-wise solution
tic;
n2=sum(sum(A.*B));
t(2)=t(2)+toc;
% MOST efficient solution - using vector product
tic;
n3=A(:)'*B(:);
t(3)=t(3)+toc;
% using trace
tic;
n4=trace(A*B');
t(4)=t(4)+toc;
% make sure everything is correct
assert(abs(n1-n2)<1e-8 && abs(n3-n4)<1e-8 && abs(n1-n4)<1e-8);
end;
t./it
You can now run this benchmark in a click.
Related
[V D] = eig(A)
gives eigenvectors with non-consistent sign, sometimes the first entry is positive, sometimes negative. It's OK for general purpose, but unfortunately for my job I need the signs to be consistent. For example in a series of such evaluations for different A. For example, I hope the first entries of all eigenvectors to be positive. What are some efficient ways to achieve this?
Here is what I think: An if-else statement to flip the sign (if 1st entry is negative, flip). But it seems not efficient as I have to evaluate eigenvectors many times.
First of all, in general eigenvalues and eigenvectors can be complex. This should be taken into account when we talk about sign. Here I assume you want the first element of all the eigenvectors to be real and positive.
This could be vectorized using bsxfun in this way:
[V, D] = eig(A);
% get the sign of the first row:
signs = sign(V(1, :));
% multiply all columns by the complex conjugate of sign of the first element:
V = bsxfun(#times, V, conj(signs));
Benchmarking:
If you compare the speed of this method with a loop of if statement, you will see that my suggestion is a bit slower. But to be fare, this method should be compared with an equivalent loop which is capable of processing complex values. This is the results of my test:
% the loop solution:
for ii = 1:size(V, 2)
V(:, ii) = V(:, ii) * conj(sign(V(1, ii)));
end
% A = rand(2);
------------------- With BSXFUN
Elapsed time is 0.744195 seconds.
------------------- With LOOP
Elapsed time is 0.500803 seconds.
% A = rand(10);
------------------- With BSXFUN
Elapsed time is 0.828464 seconds.
------------------- With LOOP
Elapsed time is 0.835429 seconds.
% A = rand(100);
------------------- With BSXFUN
Elapsed time is 1.421716 seconds.
------------------- With LOOP
Elapsed time is 4.286256 seconds.
As you see, it depends on your application. If you have many many small matrices, the loop solution looks more convenient. On the other hand, if you are dealing with bigger matrices, definitely a vectorized solution does the job more efficiently.
Only timing will tell what performs better, but complexity wise it is very efficient to look at the first element, and only operate on the whole vector if you find that it has the wrong sign.
So if you would have
E = rand(n)-0.5;
Then this solution:
if E(1)<0
E = -E;
end
would operate on 1+n/2 elements on average
Whilst something like
E = E * sign(E(1))
would operate on 1+n elements.
That being said, I would be surprised if you find a speed difference that is worth the optimization, so feel free to go for the most intuitive solution.
You can use the first element to determine the sign and flip the matrix:
[V, D] = eig(A * sign(A(1)));
Hope this helps.
If A is a n by n double matrix and B is a n by n single matrix (n is large), we want compute A*B. I know the resulting matrix is type of single. My concern is
1) Will Matlab implicitly create a temporal single matrix to store the values of A? Or, does this kind of mixed-type computation entail larger memory usage?
2) Is this mixed-type computation slower than homo-type computation? Or, does this kind of mixed-type computation slow down the program?
Should we try to do computation using homo-type data explicitly? I believe if we know how Matlab exactly work, we can predict our code's behavior more accurately. This must be helpful.
I would agree with Ander and proceed with timing to validate any claim about what to prefer (single or double precision). Here is an example for benchmarking the two approaches:
N = 1e3;
A1 = single(rand(N,N));
A2 = double(rand(N,N));
B = double(rand(N,N));
Now we can proceed to timing the two approaches. I usually do multiple repetitions of the same calculation (here I do it 100 times):
tic; for ii = 1: 100 ; C1 = A1 * B; end; toc % mixed single and double
Elapsed time is 0.600353 seconds.
tic; for ii = 1: 100 ; C2 = A2 * B; end; toc % both doubles
Elapsed time is 1.500283 seconds.
So it seems that when A is single precision (A1) it is twice as fast.
I would like to multiply each sub-block of a matrix A mxn with a matrix B pxq. For example A can be divided into k sub blocks each one of size mxp.
A = [A_1 A_2 ... A_k]
The resulting matrix will be C = [A_1*B A_2*B ... A_k*B] and I would like to do it efficiently.
What I have tried until now is:
C = A*kron(eye(k),B)
Edited: Daniel I think you are right. I tried 3 different ways. Computing a kronecker product seems to be a bad idea. Even the solution with the reshape works faster than the more compact kron solution.
tic
for i=1:k
C1(:,(i-1)*q+1:i*q) = A(:,(i-1)*p+1:i*p)*B;
end
toc
tic
C2 = A*kron(eye(k),B);
toc
tic
A = reshape(permute(reshape(A,m,p,[]),[1 3 2]),m*k,[]);
C3 = A*B;
C3 = reshape(permute(reshape(C3,m,k,[]),[1 3 2]),m,[]);
toc
When I look at your matrix multiplication code, you have perfectly optimized code within the loop. You can't beat matrix multiplication. Everything you could cut down is the overhead for the iteration, but compared to the long runtime of a matrix multiplication the overhead has absolutely no influence.
What you attempted to do would be the right strategy when the operation within the loop is trivial but the loop is iterated many times. If you take the following parameters, you will notice that your permute solution has actually it's strength, but not for your problem dimensions:
q=1;p=1;n=1;m=1;
k=10^6
Kron totally fails. Your permute solution takes 0.006s while the loop takes 1.512s
I'm working on a piece of software in MATLAB and I believe I've reached the limit of my knowledge when it comes to optimisation and efficiency. Here's where the expertise of the people on StackOverflow might be helpful.
Using MATLAB's profiler, I've found that the last inefficient line of code is a multiplication of the following form:
function [energy] = getEnergy(S,W)
energy = -(S*W*S');
end
S is a 1xN row vector, W is an NxN matrix (it's not just a diagonal matrix though), and S' is a Nx1 column vector, whose multiplication returns a number.
I understand that this is a primitive operation, but I was wondering whether there is any way to speed this up.
I tried searching Google etc, but unfortunately I do not know the right keywords to search for. I apologise if this is a duplicate.
Thanks in advance.
Your implementation is correct, and the fastest.
You can save ~20-30% of computation time by performing it inside the main code, without call to the function.
>> S = randn(1, 500);
>> W = randn(500);
>> tic; for k = 1 : 10000, e = -(S * W * S'); end; toc
Elapsed time is 0.321595 seconds.
If the bottleneck stems from the fact that you need to repeat this computation for a LOT of different vectors S, then you can do the following vectorization:
% s is k-by-N matrix of k row vectors
energy = sum( ( s * W ) .* s, 2 ); % note the .* in the middle!
I'm hoping someone can review my code below and offer hints how to speed up the section between tic and toc. The function below attempts to perform an IFFT faster than Matlab's built-in function since (1) almost all of the fft-coefficient bins are zero (i.e. 10 to 1000 bins out of 10M to 300M bins are non-zero), and (2) only the central third of the IFFT results are retained (the first and last third are discarded -- so no need to compute them in the first place).
The input variables are:
fftcoef = complex fft-coef 1D array (10 to 1000 pts long)
bins = index of fft coefficients corresponding to fftcoef (10 to 1000 pts long)
DATAn = # of pts in data before zero padding and fft (in range of 10M to 260M)
FFTn = DATAn + # of pts used to zero pad before taking fft (in range of 16M to 268M) (e.g. FFTn = 2^nextpow2(DATAn))
Currently, this code takes a few orders of magnitude longer than Matlab's ifft function approach which computes the entire spectrum then discards 2/3's of it. For example, if the input data for fftcoef and bins are 9x1 arrays (i.e. only 9 complex fft coefficients per sideband; 18 pts when considering both sidebands), and DATAn=32781534, FFTn=33554432 (i.e. 2^25), then the ifft approach takes 1.6 seconds whereas the loop below takes over 700 seconds.
I've avoided using a matrix to vectorize the nn loop since sometimes the array size for fftcoef and bins could be up to 1000 pts long, and a 260Mx1K matrix would be too large for memory unless it could be broken up somehow.
Any advice is much appreciated! Thanks in advance.
function fn_fft_v1p0(fftcoef, bins, DATAn, FFTn)
fftcoef = [fftcoef; (conj(flipud(fftcoef)))]; % fft coefficients
bins = [bins; (FFTn - flipud(bins) +2)]; % corresponding fft indices for fftcoef array
ttrend = zeros( (round(2*DATAn/3) - round(DATAn/3) + 1), 1); % preallocate
start = round(DATAn/3)-1;
tic;
for nn = start+1 : round(2*DATAn/3) % loop over desired time indices
% sum over all fft indices having non-zero coefficients
arg = 2*pi*(bins-1)*(nn-1)/FFTn;
ttrend(nn-start) = sum( fftcoef.*( cos(arg) + 1j*sin(arg));
end
toc;
end
You have to keep in mind that Matlab uses a compiled fft library (http://www.fftw.org/) for its fft functions, which besides operating much faster then a Matlab script, it is well optimized for many use-cases. So a first step might be writing your code in c/c++ and compiling it as a mex file you can use within Matlab. That will surely speed up your code at least an order of magnitude (probably more).
Besides that, one simple optimization you can do is by considering 2 things:
You assume your time series is real valued, so you can use the symmetry of the fft coeffs.
Your time series is typically much longer then your fft coeffs vector, so it is better to iterate over bins instead of time points (thus vectorizing the longer vector).
These two points are translated to the following loop:
nn=(start+1 : round(2*DATAn/3))';
ttrend2 = zeros( (round(2*DATAn/3) - round(DATAn/3) + 1), 1);
tic;
for bn = 1:length(bins)
arg = 2*pi*(bins(bn)-1)*(nn-1)/FFTn;
ttrend2 = ttrend2 + 2*real(fftcoef(bn) * exp(i*arg));
end
toc;
Note you have to use this loop before you expand bins and fftcoef, since the symmetry is already taken into account. This loop takes 8.3 seconds to run with the parameters from your question, while it takes on my pc 141.3 seconds to run with your code.
I have posted a question/answer at Accelerating FFTW pruning to avoid massive zero padding which solves the problem for the C++ case using FFTW. You can use this solution by exploiting mex-files.