Understanding how to count FLOPs - matlab

I am having a hard time grasping how to count FLOPs. One moment I think I get it, and the next it makes no sense to me. Some help explaining this would greatly be appreciated. I have looked at all other posts about this topic and none have completely explained in a programming language I am familiar with (I know some MATLAB and FORTRAN).
Here is an example, from one of my books, of what I am trying to do.
For the following piece of code, the total number of flops can be written as (n*(n-1)/2)+(n*(n+1)/2) which is equivalent to n^2 + O(n).
[m,n]=size(A)
nb=n+1;
Aug=[A b];
x=zeros(n,1);
x(n)=Aug(n,nb)/Aug(n,n);
for i=n-1:-1:1
x(i) = (Aug(i,nb)-Aug(i,i+1:n)*x(i+1:n))/Aug(i,i);
end
I am trying to apply the same principle above to find the total number of FLOPs as a function of the number of equations n in the following code (MATLAB).
% e = subdiagonal vector
% f = diagonal vector
% g = superdiagonal vector
% r = right hand side vector
% x = solution vector
n=length(f);
% forward elimination
for k = 2:n
factor = e(k)/f(k­‐1);
f(k) = f(k) – factor*g(k‐1);
r(k) = r(k) – factor*r(k‐1);
end
% back substitution
x(n) = r(n)/f(n);
for k = n‐1:­‐1:1
x(k) = (r(k)‐g(k)*x(k+1))/f(k);
end

I'm by no means expert at MATLAB but I'll have a go.
I notice that none of the lines of your code index ranges of your vectors. Good, that means that every operation I see before me is involving a single pair of numbers. So I think the first loop is 5 FLOPS per iteration, and the second is 3 per iteration. And then there's that single operation in the middle.
However, MATLAB stores everything by default as a double. So the loop variable k is itself being operated on once per loop and then every time an index is calculated from it. So that's an extra 4 for the first loop and 2 for the second.
But wait - the first loop has 'k-1' twice, so in theory one could optimise that a bit by calculating and storing that, reducing the number of FLOPs by one per iteration. The MATLAB interpreter is probably able to spot that sort of optimisation for itself. And for all I know it can work out that k could in fact be an integer and everything is still okay.
So the answer to your question is that it depends. Do you want to know the number of FLOPs the CPU does, or the minimum number expressed in your code (ie the number of operations on your vectors alone), or the strict number of FLOPs that MATLAB would perform if it did no optimisation at all? MATLAB used to have a flops() function to count this sort of thing, but it's not there anymore. I'm not an expert in MATLAB by any means, but I suspect that flops() has gone because the interpreter has gotten too clever and does a lot of optimisation.
I'm slightly curious to know why you wish to know. I used to use flops() to count how many operations a piece of maths did as a crude way of estimating how much computing grunt I'd need to make it work in real time written in C.
Nowadays I look at the primitives themselves (eg there's a 1k complex FFT, that'll be 7us on that CPU according to the library datasheet, there's a 2k vector multiply, that'll be 2.5us, etc). It gets a bit tricky because one has to consider cache speeds, data set sizes, etc. The maths libraries (eg fftw) themselves are effectively opaque so that's all one can do.
So if you're counting the FLOPs for that reason you'll probably not get a very good answer.

Related

Matlab: Solve for a single variable in a linear system of equations

I have a linear system of about 2000 sparse equations in Matlab. For my final result, I only really need the value of one of the variables: the other values are irrelevant. While there is no real problem in simply solving the equations and extracting the correct variable, I was wondering whether there was a faster way or Matlab command. For example, as soon as the required variable is calculated, the program could in principle stop running.
Is there anyone who knows whether this is at all possible, or if it would just be easier to keep solving the entire system?
Most of the computation time is spent inverting the matrix, if we can find a way to avoid completely inverting the matrix then we may be able to improve the computation time. Lets assume I'm only interested in the solution for the last variable x(N). Using the standard method we compute
x = A\b;
res = x(N);
Assuming A is full rank, we can instead use LU decomposition of the augmented matrix [A b] to get x(N) which looks like this
[~,U] = lu([A b]);
res = U(end,end-1)/U(end,end);
This is essentially performing Gaussian elimination and then solving for x(N) using back-substitution.
We can extend this to find any value of x by swapping the columns of A before LU decomposition,
x_index = 123; % the index of the solution we are interested in
A(:,[x_index,end]) = A(:,[end,x_index]);
[~,U] = lu([A b]);
res = U(end,end)/U(end,end-1);
Bench-marking performance in MATLAB2017a with 10,000 random 200 dimensional systems we get a slight speed-up
Total time direct method : 4.5401s
Total time LU method : 3.9149s
Note that you may experience some precision issues if A isn't well conditioned.
Also, this approach doesn't take advantage of the sparsity of A. In my experiments even with 2000x2000 sparse matrices everything significantly slowed down and the LU method is significantly slower. That said full matrix representation only requires about 30MB which shouldn't be a problem on most computers.
If you have access to theory manuals on NASTRAN, I believe (from memory) there is coverage of partial solutions of linear systems. Also try looking for iterative or tri diagonal solvers for A*x = b. On this page, review the pqr solution answer by Shantachhani. Another reference.

Stein-haff shrinkage formula in MATLAB

I have a data matrix X of size p*n where p=10 and n=30.
Assume the covariance matrix S = XX'/n and its eigenvalues by the vector l.
I want to compute this formula in MATLAB:
I began by writing the code but I don't know how to write the sum and especially for j different to i.
clear all;
close all;
clc
p=10;
n=30;
X=rand(p,n);
S=(X*X')/n;
l=eig(S);
Any help will be very appreciated!
The simplest way is to iterate through your values in a for loop adding them to an accumulator variable. The twist in this case - excluding the entries where i==j - would be to use an if statement.
innersum = 0; % initialise it first
for j = 1:p
if i ~= j % matlab''s way of saying 'not equal to'
innersum = innersum + 1/(l(i)-l(j));
end
end
That only takes care of the big sigma term. You would need to involve the other n, p and l terms as normal.
It looks like you'll need to produce an output for each value of i so again you will need to iterate through the is in a loop, possibly storing the results in an array
for i = 1:p
% do the big sum
results(i) = output_of_quoted_equation;
end
Note that there are many other ways of doing this, most of them involving a programming concept called 'vectorisation'. If you're not familiar with it then don't worry about it, but as it's one of Matlab's major features you'll come across it a lot, and likely be given lots of answers that use it. It's conceptually more difficult than for loops, both to learn and to read, but it's widely used because it's usually much faster for massive datasets, and it's generally part of the 'Matlab way of doing things', such as there is one. However for small and even medium datasets the performance advantage is marginal.
Note also that both i and j have special meanings in Matlab (both are shortcuts to the imaginary unit, sqrt(-1)) and although it's convenient to use them as variable names it may break other things. ii and jj are good substitutes.

Vectorizing code

I dont quite get the vectorizing way of thinking of matlab, mostly due to the simple examples provided in the documentation, and i hope someone can help me understand it a little better.
So, what i'm trying to accomplish is to take a sample of NxN from a matrix of ncols x nrows x ielements and compute the average for each ielement and store the maximum of the averages. Using for loops, the code would look like this:
for x = 1+margin : nrows-margin
for y = 1+margin : ncols-margin
for i=1:ielem
% take a NxN sample
sample = input_matrix(y-margin:y+margin,x-margin:x+margin,i)
% compute the average of all elements
result(i) = mean2(sample);
end %for i
% store the max of the computed averages
output_matrix(y,x)=max(result);
end %for y
end %for x
can anyone do a good vectorization of this example of a situation ? T
First of all, vectorization is not as important as it once was, due to enhancements in compiling the code before it is ran, but it's still a very common practice and can lead to some enhancements. Older Matlab version executed one line at a time, which would leave a for loop much slower than a vectorized version of the same code.
The part of your matrix that could be vectorized is the inner more for loop. I'll show a simple example of what you are trying to do, I'll let you take the example and put it into your code.
input=randn(5,5,3);
max(mean(mean(input,1),2))
Basically, the inner two mean take the mean of the input array, and the outer max will find the maximum value over the range. If you want, you can break it out step by step, and see what it does. The mean(input,1) will take the mean over the first dimension, mean(input,2) over the second, etc. After the first two means are done, all that is left is a vector, which the max function will easily work. It should be noted that the size of the vector pre-max is [1 1 3], the dimensions are preserved when doing this operation.

Solving multiple linear systems using vectorization

Sorry if this is obvious but I searched a while and did not find anything (or missed it).
I'm trying to solve linear systems of the form Ax=B with A a 4x4 matrix, and B a 4x1 vector.
I know that for a single system I can use mldivide to obtain x: x=A\B.
However I am trying to solve a great number of systems (possibly > 10000) and I am reluctant to use a for loop because I was told it is notably slower than matrix formulation in many MATLAB problems.
My question is then: is there a way to solve Ax=B using vectorization with A 4x4x N and B a matrix 4x N ?
PS: I do not know if it is important but the B vector is the same for all the systems.
You should use a for loop. There might be a benefit in precomputing a factorization and reusing it, if A stays the same and B changes. But for your problem where A changes and B stays the same, there's no alternative to solving N linear systems.
You shouldn't worry too much about the performance cost of loops either: the MATLAB JIT compiler means that loops can often be just as fast on recent versions of MATLAB.
I don't think you can optimize this further. As explained by #Tom, since A is the one changing, there is no benefit in factoring the various A's beforehand...
Besides the looped solution is pretty fast given the dimensions you mention:
A = rand(4,4,10000);
B = rand(4,1); %# same for all linear systems
tic
X = zeros(4,size(A,3));
for i=1:size(A,3)
X(:,i) = A(:,:,i)\B;
end
toc
Elapsed time is 0.168101 seconds.
Here's the problem:
you're trying to perform a 2D operation (mldivide) on a 3d matrix. No matter how you look at it, you need reference the matrix by index which is where the time penalty kicks in... it's not the for loop which is the problem, but it's how people use them.
If you can structure your problem differently, then perhaps you can find a better option, but right now you have a few options:
1 - mex
2 - parallel processing (write a parfor loop)
3 - CUDA
Here's a rather esoteric solution that takes advantage of MATLAB's peculiar optimizations. Construct an enormous 4k x 4k sparse matrix with your 4x4 blocks down the diagonal. Then solve all simultaneously.
On my machine this gets the same solution up to single precision accuracy as #Amro/Tom's for-loop solution, but faster.
n = size(A,1);
k = size(A,3);
AS = reshape(permute(A,[1 3 2]),n*k,n);
S = sparse( ...
repmat(1:n*k,n,1)', ...
bsxfun(#plus,reshape(repmat(1:n:n*k,n,1),[],1),0:n-1), ...
AS, ...
n*k,n*k);
X = reshape(S\repmat(B,k,1),n,k);
for a random example:
For k = 10000
For loop: 0.122570 seconds.
Giant sparse system: 0.032287 seconds.
If you know that your 4x4 matrices are positive definite then you can use chol on S to improve the accuracy.
This is silly. But so is how slow matlab's for loops still are in 2015, even with JIT. This solution seems to find a sweet spot when k is not too large so everything still fits into memory.
I know this post is years old now, but I'll contribute my two cents anyway. You CAN put all of your A matricies into a bigger block diagonal matrix, where there will be 4x4 blocks on the diagonal of a big matrix. The right hand side will be all of your b vectors stacked on top of each other over and over. Once you set this up, it is represented as a sparse system, and can be efficiently solved with the algorithms mldivide chooses. The blocks are numerically decoupled, so even if there are singular blocks in there, the answers for the nonsingular blocks should be right when you use mldivide. There is a code that took this approach on MATLAB Central:
http://www.mathworks.com/matlabcentral/fileexchange/24260-multiple-same-size-linear-solver
I suggest experimenting to see if the approach is any faster than looping. I suspect it can be more efficient, especially for large numbers of small systems. In particular, if there are nice formulas for the coefficients of A across the N matricies, you can build the full left hand side using MATLAB vector operations (without looping), which could give you additional cost savings. As others have noted, vectorized operations aren't always faster, but they often are in my experience.

MATLAB: Convolution of Matrix Valued Function

I've written this code to perform the 1-d convolution of a 2-d matrix valued function (k is my time index, kend is on the order of 10e3). Is there a faster or cleaner way to do this, perhaps using built in functions?
for k=1:kend
C(:,:,k)=zeros(3);
for l=0:k-1
C(:,:,k)=C(:,:,k)+A(:,:,k-l)*B(:,:,l+1);
end
end
NEW SOLUTION:
This is a newer solution built on the older solution, which solved the previously given formula. The code in the question is actually a modification of that formula, in which the overlap between the two matrices in the third dimension is repeatedly shifted (it's akin to a convolution along the third dimension of the data). The previous solution I gave only computed the result for the last iteration of the code in the question (i.e. k = kend). So, here's a full solution that should be much more efficient than the code in the question for kend on the order of 1000:
kend = size(A,3); %# Get the value for kend
C = zeros(3,3,kend); %# Preallocate the output
Anew = reshape(flipdim(A,3),3,[]); %# Reshape A into a 3-by-3*kend matrix
Bnew = reshape(permute(B,[1 3 2]),[],3); %# Reshape B into a 3*kend-by-3 matrix
for k = 1:kend
C(:,:,k) = Anew(:,3*(kend-k)+1:end)*Bnew(1:3*k,:); %# Index Anew and Bnew so
end %# they overlap in steps
%# of three
Even when using just kend = 100, this solution came out to be about 30 times faster for me than the one in the question and about 4 times faster than a pure for-loop-based solution (which would involve 5 loops!). Note that the discussion below of floating-point accuracy still applies, so it is normal and expected that you will see slight differences between the solutions on the order of the relative floating-point accuracy.
OLD SOLUTION:
Based on this formula you linked to in a comment:
it appears that you actually want to do something different than the code you provided in the question. Assuming A and B are 3-by-3-by-k matrices, the result C should be a 3-by-3 matrix and the formula from your link written out as a set of nested for loops would look like this:
%# Solution #1: for loops
k = size(A,3);
C = zeros(3);
for i = 1:3
for j = 1:3
for r = 1:3
for l = 0:k-1
C(i,j) = C(i,j) + A(i,r,k-l)*B(r,j,l+1);
end
end
end
end
Now, it is possible to perform this operation without any for loops by reshaping and reorganizing A and B appropriately:
%# Solution #2: matrix multiply
Anew = reshape(flipdim(A,3),3,[]); %# Create a 3-by-3*k matrix
Bnew = reshape(permute(B,[1 3 2]),[],3); %# Create a 3*k-by-3 matrix
C = Anew*Bnew; %# Perform a single matrix multiply
You could even rework the code you have in your question to create a solution with a single loop that performs a matrix multiply of your 3-by-3 submatrices:
%# Solution #3: mixed (loop and matrix multiplication)
k = size(A,3);
C = zeros(3);
for l = 0:k-1
C = C + A(:,:,k-l)*B(:,:,l+1);
end
So now the question: Which one of these approaches is faster/cleaner?
Well, "cleaner" is very subjective, and I honestly couldn't tell you which of the above pieces of code makes it any easier to understand what the operation is doing. All the loops and variables in the first solution make it a little hard to track what's going on, but it clearly mirrors the formula. The second solution breaks it all down into a simple matrix operation, but it's difficult to see how it relates to the original formula. The third solution seems like a middle-ground between the two.
So, let's make speed the tie-breaker. If I time the above solutions for a number of values of k, I get these results (in seconds needed to perform 10,000 iterations of the given solution, MATLAB R2010b):
k | loop | matrix multiply | mixed
-----+--------+-----------------+--------
5 | 0.0915 | 0.3242 | 0.1657
10 | 0.1094 | 0.3093 | 0.2981
20 | 0.1674 | 0.3301 | 0.5838
50 | 0.3181 | 0.3737 | 1.3585
100 | 0.5800 | 0.4131 | 2.7311 * The matrix multiply is now fastest
200 | 1.2859 | 0.5538 | 5.9280
Well, it turns out that for smaller values of k (around 50 or less) the for-loop solution actually wins out, showing once again that for loops are not as "evil" as they used to be considered in older versions of MATLAB. Under certain circumstances, they can be more efficient than a clever vectorization. However, when the value of k is larger than around 100, the vectorized matrix-multiply solution starts to win out, scaling much more nicely with increasing k than the for-loop solution does. The mixed for-loop/matrix-multiply solution scales atrociously for reasons that I'm not exactly sure of.
So, if you expect k to be large, I'd go with the vectorized matrix-multiply solution. One thing to keep in mind is that the results you get from each solution (the matrix C) will differ ever so slightly (on the level of the floating-point precision) since the order of additions and multiplications performed for each solution are different, thus leading to a difference in accumulation of rounding errors. In short, the difference between the results for these solutions should be negligible, but you should be aware of it.
Have you looked into Matlab's conv method?
I can't compare it against your provided code, because what you provided gives me a problem with trying to access the zeroth element of A. (When k=1, k-1=0.)
Have you considered using FFTs to convolve? A convolution operation is simply a point-wise multiplication in the frequency domain. You'll have to take some precaution with finite sequences, as you'll end up with circular convolution if you're not careful (but this is trivial to take care of).
Here's a simple example for a 1D case.
>> a=rand(4,1);
>> b=rand(3,1);
>> c=conv(a,b)
c =
0.1167
0.3133
0.4024
0.5023
0.6454
0.3511
The same using FFTs
>> A=fft(a,6);
>> B=fft(b,6);
>> C=real(ifft(A.*B))
C =
0.1167
0.3133
0.4024
0.5023
0.6454
0.3511
A convolution of an M point vector and an N point vector results in an M+N-1 point vector. So, I've padded each of the vectors a and b with zeros before taking the FFT (this is automatically taken care of when I take the 4+3-1=6 point FFT of it).
EDIT
Although the equation that you showed is similar to a circular convolution, it's not exactly it. So you can ditch the FFT approach, and the built-in conv* functions. To answer your question, here's the same operation done without explicit loops:
dim1=3;dim2=dim1;
dim3=10;
a=rand(dim1,dim2,dim3);
b=rand(dim1,dim2,dim3);
mIndx=cellfun(#(x)(1:x),num2cell(1:dim3),'UniformOutput',false);
fun=#(x)sum(reshape(cell2mat(cellfun(#(y,z)a(:,:,y)*b(:,:,z),num2cell(x),num2cell(fliplr(x)),'UniformOutput',false)),[dim1,dim2,max(x)]),3);
c=reshape(cell2mat(cellfun(#(x)fun(x),mIndx,'UniformOutput',false)),[dim1,dim2,dim3]);
mIndx here is a cell, where the ith cell contains a vector 1:i. This is your l index (as others have noted, please don't use l as a variable name).
The next line is an anonymous function that does the convolution operation, making use of the fact that the k index is just the l index flipped around. The operations are carried out on individual cells, and then assembled.
The last line actually performs the operations on the matrices.
The answer is the same as that obtained with the loops. However, you'll find that the looped solution is actually an order of magnitude faster (I averaged 0.007s for my code and 0.0006s for the loop). This is because the loop is pretty straightforward, whereas with this sort of nested construction, there's plenty of function call overheads and repeated reshaping that slow it down.
MATLAB's loops have come a long way since the early days when loops were dreaded. Certainly, vectorized operations are blazing fast; but not everything can be vectorized, and sometimes, loops are more efficient than such convoluted anonymous functions. I could probably shave off a few more tenths here and there by optimizing my construction (or maybe taking a different approach), but I'm not going to do that.
Remember that good code should be readable, as well as efficient and minor optimization at the cost of readability serves no one. Although I wrote the code above, I certainly will not be able to decipher what it does if I revisited it a month later. Your looped code was clear, readable and fast and I would suggest that you stick with it.