Trace of inverse-matrix product in scipy - scipy

I'm looking to calculate the trace of an inverse-matrix product efficiently, i.e. Tr(A^-1 B). I use the inverse of A in many other places, so I have access to either the Cholesky decomposition or the explicit inverse.
The naive thing to do would be one of the following two:
np.trace(np.dot(invA, B))
np.trace(linalg.cho_solve(choA, B))
However, this seems wasteful, as it computes the full matrix A^-1 B at O(N^3), before only using its diagonal to calculate the trace.
Given the explicit inverse, the O(N^2) solution would be to do:
np.sum(invA.T * B)
Though this requires an explicit inverse, which is undesirable.
I think the ideal way to do it, would be to only calculate the diagonal elements of A^-1 B given the Cholesky decomposition and then simply sum. Is this possible using scipy? Or is there another way of calculating Tr(invA*B) in a numerically stable way, given the Cholesky decomposition?

The einsum suggested by #DSM intuitively seems like it would be the fastest way, as it would only calculate the required terms. Doing this manually in python would be slower and unless there is a specialised routine in numpy/scipy, a mathematical trick or without writing something in a lower level language, I can't see a better way. To check timings, we set up a dummy array,
import numpy as np
from scipy import linalg
invA = np.random.rand(100,100)
B = np.random.rand(100,100)
choA = linalg.cholesky(a)
Trying the naive suggestion,
%timeit np.sum(invA.T * B)
10000 loops, best of 3: 38.5 us per loop
The method using the inverse
%timeit np.sum(invA.T * B)
10000 loops, best of 3: 39.1 us per loop
Seems about the same. Finally using the einsum,
%timeit np.einsum("ij,ji", invA, B)
100000 loops, best of 3: 17.4 us per loop
which seems to be approximately twice as quick.

Related

Circular convolution of binary vectors (mod 2) using NTT

Let x, y be vectors of length n, with entries either 1 or 0. I want to efficiently compute the circular convolution
(x * y) mod 2
Where each component of the result is taken mod 2.
I know how to do it using a Fast Fourier Transform
(multiply Fourier transforms of x and y. transform back. Do the "mod 2")
However, this uses floating point calculations to solve a discrete problem and for large n (I'm interested in n ~ 10^7) it might lead to rounding errors. I expect there is a better way to do this using the number theoretic transform (NTT) but unfortunately I'm not familiar with number theory or NTT.
I looked at this website. Following the procedure there,
let's say n = 10^7. I need
a modulus M (use 10^7).
a prime N=kn+1 for some k. (use N = 3 * 10^7 + 1)
a root ω≡g^k mod N , where g is a generator (e.g. ω=2744)
Do the transform, etc.
Question
This seems promising. However, I would need 32-bit integers to store each bit during this calculation?
Also, this is not making use of the fact that I only need results modulo 2.
Is there a way to make use of this to simplify the procedure?
Since I don't know the number theory, this is not obvious to me.
I'm not asking for a full solution, only for an argument if my "mod 2" significantly simplifies the implementation (both in terms of difficulty to implement the necessary algorithms as well as computational resources).
Another question: If it's not possible to simplify using "mod 2", do you think it would still pay off to use NTT, as opposed to just throwing a well-known FFT library at the floating point problem?
For the NTT, your procedure looks correct. Yes, you would need 32-bit integers for each bit in your original vector. Unfortunately, there's not a lot you can do there to make use of the fact that the end result is mod 2, since you need a root of order 10^7. You may be able to shrink that number by a couple factors of two (and doing the standard DFT for a few base levels of recursion), but it wouldn't change much, relatively speaking.
Note, for your FFT implementation, I believe you could use integer arithmetic since its mod 2, but I'm not convinced it would be at all efficient. See this math stackexchange answer for details.

Matlab: Solve for a single variable in a linear system of equations

I have a linear system of about 2000 sparse equations in Matlab. For my final result, I only really need the value of one of the variables: the other values are irrelevant. While there is no real problem in simply solving the equations and extracting the correct variable, I was wondering whether there was a faster way or Matlab command. For example, as soon as the required variable is calculated, the program could in principle stop running.
Is there anyone who knows whether this is at all possible, or if it would just be easier to keep solving the entire system?
Most of the computation time is spent inverting the matrix, if we can find a way to avoid completely inverting the matrix then we may be able to improve the computation time. Lets assume I'm only interested in the solution for the last variable x(N). Using the standard method we compute
x = A\b;
res = x(N);
Assuming A is full rank, we can instead use LU decomposition of the augmented matrix [A b] to get x(N) which looks like this
[~,U] = lu([A b]);
res = U(end,end-1)/U(end,end);
This is essentially performing Gaussian elimination and then solving for x(N) using back-substitution.
We can extend this to find any value of x by swapping the columns of A before LU decomposition,
x_index = 123; % the index of the solution we are interested in
A(:,[x_index,end]) = A(:,[end,x_index]);
[~,U] = lu([A b]);
res = U(end,end)/U(end,end-1);
Bench-marking performance in MATLAB2017a with 10,000 random 200 dimensional systems we get a slight speed-up
Total time direct method : 4.5401s
Total time LU method : 3.9149s
Note that you may experience some precision issues if A isn't well conditioned.
Also, this approach doesn't take advantage of the sparsity of A. In my experiments even with 2000x2000 sparse matrices everything significantly slowed down and the LU method is significantly slower. That said full matrix representation only requires about 30MB which shouldn't be a problem on most computers.
If you have access to theory manuals on NASTRAN, I believe (from memory) there is coverage of partial solutions of linear systems. Also try looking for iterative or tri diagonal solvers for A*x = b. On this page, review the pqr solution answer by Shantachhani. Another reference.

Efficient implementation of a sequence of matrix-vector products / specific "tensor"-matrix product

I have a special algorithm where as one of the lasts steps I need to carry out a multiplication of a 3-D array with a 2-D array such that each matrix-slice of the 3-D array is multiplied wich each column of the 2-D array. In other words, if, say A is an N x N x N matrix and B is an N x N matrix, I need to compute a matrix C of size N x N where C(:,i) = A(:,:,i)*B(:,i);.
The naive way to implement this is a loop, i.e.,
C = zeros(N,N);
for i = 1:N
C(:,i) = A(:,:,i)*B(:,i);
end
However, loops aren't the fastest in Matlab and should be avoided. I'm looking for faster ways of doing this. Right now, what I do is to use the fact that (now Mathjax would be great!):
[A1 b1, A2 b2, ..., AN bN] = [A1, A2, ..., AN]*blkdiag(b1,b2,...,bN)
This allows to get rid of the loop, however, we have to create a block-diagonal matrix of size N^2 x N. I'm making it via sparse to be efficient, i.e., like this:
A_long = reshape(A,N,N^2);
b_cell = mat2cell(B,N,ones(1,N)); % convert matrix to cell array of vectors
b_cell{1} = sparse(b_cell{1}); % make first element sparse, this is enough to trigger blkdiag into sparse mode
B_blk = blkdiag(b_cell{:});
C = A_long*B_blk;
According to my benchmarks, this approach is faster than the loop by a factor of around two (for large N), despite the necessary preparations (the multiplication alone is 3 to 4-fold faster than the loop).
Here is a quick benchmark I did, varying the problem size N and measuring the time for the loop and the alternative approach (with and without the preparation steps). For large N the speedup is around 2...2.5.
Still, this looks awfully complicated to me. Is there a simpler or better way to achieve this? This looks like it's a quite generic/standard problem so I could imagine that solutions are around, I just don't know what to search for really.
P.S.: blkdiag(A1,...,AN)*B is an obvious alternative but here the block diagonal is already N^2 x N^2 so I don't think it can be better than what I did.
edit: Thanks to everyone for commenting! I have carried out a new benchmark on a Matlab R2016b. Unfortunately, I do not have both versions on the same computer so we cannot compare the absolute numbers but the relative comparison is still interesting, since it has changed a bit. Here it is:
And here is a zoom on the high-N area:
Couple of observations:
SumRepDot is the solution proposed by Divakar, namely, to use squeeze(sum(bsxfun(#times,A,permute(B,[3,1,2])),2)) which on R2016b simplifies to squeeze(sum(A.*permute(B,[3,1,2]),2)). It is faster than the loop for high N by a factor of around 1.2...1.4.
The loop is still "slow" in a sense that the multiplication with the sparse block diagonal matrix is much faster.
For the latter, the preparation overhead seems to become negligible for high N which makes it overall a factor of 3...4 faster than the loop. This is a nice result.

What is the difference between 'qr' and 'SVD' in Matlab to get the single vectors of a matrix?

Spefifically, the following two kinds of code can get the same S and V idealy. However, the second one's speed is usually faster than the first one in Matlab. Can someone tell me the reason?
Moreover, which method is more numerically stable?
Thanks.
[~,S,V] = svd(B,'econ');
[Qc,Rc] = qr(B',0);
[U,S,~] = svd(Rc,'econ');
V = Qc*U;
The second method does not have to be faster. For almost squared matrices it can be slower. Consider as example the Golub-Reinsch SVD-algorithm:
Its work depends on the output you want to calculate (only S, Sand V or S,V and U).
If you want to calculate Sand V without performing any preprocessing the required work is 4mn^2+8n^3.
If you perform QR-decomposition before this the needed amount of work is: 2/3n^3+n^2+1/3n-2 for the Housholder transformation. Now if your Matrix was almost squared, i.e m=n, you will have gained not much as R is still m x n. However if m is larger than n you can reduce R to an n x n matrix (called thin QR factorization). Now you want to calculate Uand S which will add 12n^3 for your SVD-algorithm.
So only SVD: 4mn^2+8n^3
SVD with QR: (12+2/3)n^3+n^2+1/3n-2
However most SVD-algorithms should inculde some (R-) bidiagonalizations which will reduce the work to: 2mn^2+11n^3
You can also apply QR, the R-bifactorization and then SVD to make it even faster but it all depends on your matrix dimensions.
Matlab uses for SVD the Lapack libraries. You can look up the exact runtimes here. They're approximately the same as above algorithm.
Hope this helps.

Cholesky factorization

within a matlab code of mine, I have to deal with the Cholesky factorization of a certain given matrix. I am generally calling chol(A,'lower') to generate the lower triangular factor.
Now, checking my code with the profiler, it is evident that function chol is really time consuming, especially if the size of the input matrix becomes large.
Therefore, I would like to know, if there is any valuable alternative to the built-in chol function.
I have been thinking of the LAPACK library, and namely of spptrf function. Is it available in MATLAB or not?
Any hint or support are more than welcome.
EDIT
Just as an example, the profiler retrieves this information:
where Coh_u has size (1395*1395). It has, also, to be remarked that chol is called 4000 times, since I need the cholesky factor for 4000 different configurations.
I'm not sure what version of matlab you are using, but I found this discussion, which suggests in older versions that Cholesky Factorization was very slow as you're describing.
One of the answers there says to use the CHOLMOD package or SuiteSparse, which has a chol2 function that is supposed to be faster.
Can you confirm if the correct expression for Coh_u is as under
a) Coh_u = exp(-a.*sqrt((f(ii)/Uhub).^2 + (0.12/Lc).^2)).*(df.*psd(ii,1));
or
b) Coh_u = exp(-a.*dist*sqrt((f(ii)/Uhub).^2 + (0.12/Lc).^2)).*(df.*psd(ii,1));
The difference in a) and b) is that in b) dist has been added which is the distance between two matrices Y an Z such that
dist = pdist2([Y(:) Z(:)],[Y(:) Z(:)]);
But it leads into the "Matrix not positive definite" error with the chol() function.