I need to perform lots of evaluations of the form
X(:,i)' * A * X(:,i) i = 1...n
where X(:,i) is a vector and A is a symmetric matrix. Ostensibly, I can either do this in a loop
for i=1:n
z(i) = X(:,i)' * A * X(:,i)
end
which is slow, or vectorise it as
z = diag(X' * A * X)
which wastes RAM unacceptably when X has a lot of columns. Currently I am compromising on
Y = A * X
for i=1:n
z(i) = Y(:,i)' * X(:,i)
end
which is a little faster/lighter but still seems unsatisfactory.
I was hoping there might be some matlab/scilab idiom or trick to achieve this result more efficiently?
Try this in MATLAB:
z = sum(X.*(A*X));
This gives results equivalent to Federico's suggestion using the function DOT, but should run slightly faster. This is because the DOT function internally computes the result the same way as I did above using the SUM function. However, DOT also has additional input argument checks and extra computation for cases where you are dealing with complex numbers, which is extra overhead you probably don't want or need.
A note on computational efficiency:
Even though the time difference is small between how fast the two methods run, if you are going to be performing the operation many times over it's going to start to add up. To test the relative speeds, I created two 100-by-100 matrices of random values and timed the two methods over many runs to get an average execution time:
METHOD AVERAGE EXECUTION TIME
--------------------------------------------
Z = sum(X.*Y); 0.0002595 sec
Z = dot(X,Y); 0.0003627 sec
Using SUM instead of DOT therefore reduces the execution time of this operation by about 28% for matrices with around 10,000 elements. The larger the matrices, the more negligible this difference will be between the two methods.
To summarize, if this computation represents a significant bottleneck in how fast your code is running, I'd go with the solution using SUM. Otherwise, either solution should be fine.
Try this:
z = dot(X, A*X)
I don't have Matlab here to test, but it works on Octave, so I expect Matlab to have an analogous dot() function.
From Octave's help:
-- Function File: dot (X, Y, DIM)
Computes the dot product of two vectors. If X and Y are matrices,
calculate the dot-product along the first non-singleton dimension.
If the optional argument DIM is given, calculate the dot-product
along this dimension.
For completeness, gnovice's answer in Scilab would be
z = sum(X .* Y, 1)'
Related
I'm solving a pair of non-linear equations for each voxel in a dataset of a ~billion voxels using fsolve() in MATLAB 2016b.
I have done all the 'easy' optimizations that I'm aware of. Memory localization is OK, I'm using parfor, the equations are in fairly numerically simple form. All discontinuities of the integral are fed to integral(). I'm using the Levenberg-Marquardt algorithm with good starting values and a suitable starting damping constant, it converges on average with 6 iterations.
I'm now at ~6ms per voxel, which is good, but not good enough. I'd need a order of magnitude reduction to make the technique viable. There's only a few things that I can think of improving before starting to hammer away at accuracy:
The splines in the equation are for quick sampling of complex equations. There are two for each equation, one is inside the 'complicated nonlinear equation'. They represent two equations, one which is has a large amount of terms but is smooth and has no discontinuities and one which approximates a histogram drawn from a spectrum. I'm using griddedInterpolant() as the editor suggested.
Is there a faster way to sample points from pre-calculated distributions?
parfor i=1:numel(I1)
sols = fsolve(#(x) equationPair(x, input1, input2, ...
6 static inputs, fsolve options)
output1(i) = sols(1); output2(i) = sols(2)
end
When calling fsolve, I'm using the 'parametrization' suggested by Mathworks to input the variables. I have a nagging feeling that defining a anonymous function for each voxel is taking a large slice of the time at this point. Is this true, is there a relatively large overhead for defining the anonymous function again and again? Do I have any way to vectorize the call to fsolve?
There are two input variables which keep changing, all of the other input variables stay static. I need to solve one equation pair for each input pair so I can't make it a huge system and solve it at once. Do I have any other options than fsolve for solving pairs of nonlinear equations?
If not, some of the static inputs are the fairly large. Is there a way to keep the inputs as persistent variables using MATLAB's persistent, would that improve performance? I only saw examples of how to load persistent variables, how could I make it so that they would be input only once and future function calls would be spared from the assumedly largish overhead of the large inputs?
EDIT:
The original equations in full form look like:
Where:
and:
Everything else is known, solving for x_1 and x_2. f_KN was approximated by a spline. S_low (E) and S_high(E) are splines, the histograms they are from look like:
So, there's a few things I thought of:
Lookup table
Because the integrals in your function do not depend on any of the parameters other than x, you could make a simple 2D-lookup table from them:
% assuming simple (square) range here, adjust as needed
[x1,x2] = meshgrid( linspace(0, xmax, N) );
LUT_high = zeros(size(x1));
LUT_low = zeros(size(x1));
for ii = 1:N
LUT_high(:,ii) = integral(#(E) Fhi(E, x1(1,ii), x2(:,ii)), ...
0, E_high, ...
'ArrayValued', true);
LUT_low(:,ii) = integral(#(E) Flo(E, x1(1,ii), x2(:,ii)), ...
0, E_low, ...
'ArrayValued', true);
end
where Fhi and Flo are helper functions to compute those integrals, vectorized with scalar x1 and vector x2 in this example. Set N as high as memory will allow.
Those lookup tables you then pass as parameters to equationPair() (which allows parfor to distribute the data). Then just use interp2 in equationPair():
F(1) = I_high - interp2(x1,x2,LUT_high, x(1), x(2));
F(2) = I_low - interp2(x1,x2,LUT_low , x(1), x(2));
So, instead of recomputing the whole integral every time, you evaluate it once for the expected range of x, and reuse the outcomes.
You can specify the interpolation method used, which is linear by default. Specify cubic if you're really concerned about accuracy.
Coarse/Fine
Should the lookup table method not be possible for some reason (memory limitations, in case the possible range of x is too big), here's another thing you could do: split up the whole procedure in 2 parts, which I'll call coarse and fine.
The intent of the coarse method is to improve your initial estimates really quickly, but perhaps not so accurately. The quickest way to approximate that integral by far is via the rectangle method:
do not approximate S with a spline, just use the original tabulated data (so S_high/low = [S_high/low#E0, S_high/low#E1, ..., S_high/low#E_high/low]
At the same values for E as used by the S data (E0, E1, ...), evaluate the exponential at x:
Elo = linspace(0, E_low, numel(S_low)).';
integrand_exp_low = exp(x(1)./Elo.^3 + x(2)*fKN(Elo));
Ehi = linspace(0, E_high, numel(S_high)).';
integrand_exp_high = exp(x(1)./Ehi.^3 + x(2)*fKN(Ehi));
then use the rectangle method:
F(1) = I_low - (S_low * Elo) * (Elo(2) - Elo(1));
F(2) = I_high - (S_high * Ehi) * (Ehi(2) - Ehi(1));
Running fsolve like this for all I_low and I_high will then have improved your initial estimates x0 probably to a point close to "actual" convergence.
Alternatively, instead of the rectangle method, you use trapz (trapezoidal method). A tad slower, but possibly a bit more accurate.
Note that if (Elo(2) - Elo(1)) == (Ehi(2) - Ehi(1)) (step sizes are equal), you can further reduce the number of computations. In that case, the first N_low elements of the two integrands are identical, so the values of the exponentials will only differ in the N_low + 1 : N_high elements. So then just compute integrand_exp_high, and set integrand_exp_low equal to the first N_low elements of integrand_exp_high.
The fine method then uses your original implementation (with the actual integral()s), but then starting at the updated initial estimates from the coarse step.
The whole objective here is to try and bring the total number of iterations needed down from about 6 to less than 2. Perhaps you'll even find that the trapz method already provides enough accuracy, rendering the whole fine step unnecessary.
Vectorization
The rectangle method in the coarse step outlined above is easy to vectorize:
% (uses R2016b implicit expansion rules)
Elo = linspace(0, E_low, numel(S_low));
integrand_exp_low = exp(x(:,1)./Elo.^3 + x(:,2).*fKN(Elo));
Ehi = linspace(0, E_high, numel(S_high));
integrand_exp_high = exp(x(:,1)./Ehi.^3 + x(:,2).*fKN(Ehi));
F = [I_high_vector - (S_high * integrand_exp_high) * (Ehi(2) - Ehi(1))
I_low_vector - (S_low * integrand_exp_low ) * (Elo(2) - Elo(1))];
trapz also works on matrices; it will integrate over each column in the matrix.
You'd call equationPair() then using x0 = [x01; x02; ...; x0N], and fsolve will then converge to [x1; x2; ...; xN], where N is the number of voxels, and each x0 is 1×2 ([x(1) x(2)]), so x0 is N×2.
parfor should be able to slice all of this fairly easily over all the workers in your pool.
Similarly, vectorization of the fine method should also be possible; just use the 'ArrayValued' option to integral() as shown above:
F = [I_high_vector - integral(#(E) S_high(E) .* exp(x(:,1)./E.^3 + x(:,2).*fKN(E)),...
0, E_high,...
'ArrayValued', true);
I_low_vector - integral(#(E) S_low(E) .* exp(x(:,1)./E.^3 + x(:,2).*fKN(E)),...
0, E_low,...
'ArrayValued', true);
];
Jacobian
Taking derivatives of your function is quite easy. Here is the derivative w.r.t. x_1, and here w.r.t. x_2. Your Jacobian will then have to be a 2×2 matrix
J = [dF(1)/dx(1) dF(1)/dx(2)
dF(2)/dx(1) dF(2)/dx(2)];
Don't forget the leading minus sign (F = I_hi/lo - g(x) → dF/dx = -dg/dx)
Using one or both of the methods outlined above, you can implement a function to compute the Jacobian matrix and pass this on to fsolve via the 'SpecifyObjectiveGradient' option (via optimoptions). The 'CheckGradients' option will come in handy there.
Because fsolve usually spends the vast majority of its time computing the Jacobian via finite differences, manually computing a value for it manually will normally speed the algorithm up tremendously.
It will be faster, because
fsolve doesn't have to do extra function evaluations to do the finite differences
the convergence rate will increase due to the improved precision of the Jacobian
Especially if you use the rectangle method or trapz like above, you can reuse many of the computations you've already done for the function values themselves, meaning, even more speed-up.
Rody's answer was the correct one. Supplying the Jacobian was the single largest factor. Especially with the vectorized version, there were 3 orders of magnitude of difference in speed with the Jacobian supplied and not.
I had trouble finding information about this subject online so I'll spell it out here for future reference: It is possible to vectorize independant parallel equations with fsolve() with great gains.
I also did some work with inlining fsolve(). After supplying the Jacobian and being smarter about the equations, the serial version of my code was mostly overhead at ~1*10^-3 s per voxel. At that point most of the time inside the function was spent passing around a options -struct and creating error-messages which are never sent + lots of unused stuff assumedly for the other optimization functions inside the optimisation function (levenberg-marquardt for me). I succesfully butchered the function fsolve and some of the functions it calls, dropping the time to ~1*10^-4s per voxel on my machine. So if you are stuck with a serial implementation e.g. because of having to rely on the previous results it's quite possible to inline fsolve() with good results.
The vectorized version provided the best results in my case, with ~5*10^-5 s per voxel.
I'm trying to code a MATLAB program and I have arrived at a point where I need to do the following. I have this equation:
I must find the value of the constant "Xcp" (greater than zero), that is the value that makes the integral equal to zero.
In order to do so, I have coded a loop in which the the value of Xcp advances with small increments on each iteration and the integral is performed and checked if it's zero, if it reaches zero the loop finishes and the Xcp is stored with this value.
However, I think this is not an efficient way to do this task. The running time increases a lot, because this loop is long and has the to perform the integral and the integration limits substitution every time.
Is there a smarter way to do this in Matlab to obtain a better code efficiency?
P.S.: I have used conv() to multiply both polynomials. Since cl(x) and (x-Xcp) are both polynomials.
EDIT: Piece of code.
p = [1 -Xcp]; % polynomial (x-Xcp)
Xcp=0.001;
i=1;
found=false;
while(i<=x_te && found~=true) % Xcp is upper bounded by x_te
int_cl_p = polyint(conv(cl,p));
Cm_cp=(-1/c^2)*diff(polyval(int_cl_p,[x_le,x_te]));
if(Cm_cp==0)
found=true;
else
Xcp=Xcp+0.001;
end
end
This is the code I used to run this section. Another problem is that I have to do it for different cases (different cl functions), for this reason the code is even more slow.
As far as I understood, you need to solve the equation for X_CP.
I suggest using symbolic solver for this. This is not the most efficient way for large polynomials, but for polynomials of degree 20 it takes less than 1 second. I do not claim that this solution is fastest, but this provides generic solution to the problem. If your polynomial does not change every iteration, then you can use this generic solution many times and not spend time for calculating integral.
So, generic symbolic solution in terms of xLE and xTE is obtained using this:
syms xLE xTE c x xCP
a = 1:20;
%//arbitrary polynomial of degree 20
cl = sum(x.^a.*randi([-100,100],1,20));
tic
eqn = -1/c^2 * int(cl * (x-xCP), x, xLE, xTE) == 0;
xCP = solve(eqn,xCP);
pretty(xCP)
toc
Elapsed time is 0.550371 seconds.
You can further use matlabFunction for finding the numerical solutions:
xCP_numerical = matlabFunction(xCP);
%// we then just plug xLE = 10 and xTE = 20 values into function
answer = xCP_numerical(10,20)
answer =
19.8038
The slight modification of the code can allow you to use this for generic coefficients.
Hope that helps
If you multiply by -1/c^2, then you can rearrange as
and integrate however you fancy. Since c_l is a polynomial order N, if it's defined in MATLAB using the usual notation for polyval, where coefficients are stored in a vector a such that
then integration is straightforward:
MATLAB code might look something like this
int_cl_p = polyint(cl);
int_cl_x_p = polyint([cl 0]);
X_CP = diff(polyval(int_cl_x_p,[x_le,x_te]))/diff(polyval(int_cl_p,[x_le,x_te]));
I have a cell array myBasis of sparse matricies B_1,...,B_n.
I want to evaluate with Matlab the matrix Q(i,j) = trace (B^T_i * B_j).
Therefore, I wrote the following code:
for i=1:n
for j=1:n
B=myBasis{i};
C=myBasis{j};
Q(i,j)=trace(B'*C);
end
end
Which takes already 68 seconds when n=1226 and B_i has 50 rows, and 50 colums.
Is there any chance to speed this up? Usually I exclude for-loops from my matlab code in a c++ file - but I have no experience how to handle a sparse cell array in C++.
As noted by Inox Q is symmetric and therefore you only need to explicitly compute half the entries.
Computing trace( B.'*C ) is equivalent to B(:).'*C(:):
trace(B.'*C) = sum_i [B.'*C]_ii = sum_i sum_j B_ij * C_ij
which is the sum of element-wise products and therefore equivalent to B(:).'*C(:).
When explicitly computing trace( B.'*C ) you are actually pre-computing all k-by-k entries of B.'*C only to use the diagonal later on. AFAIK, Matlab does not optimize its calculation to save it from computing all the entries.
Here's a way
for ii = 1:n
B = myBasis{ii};
for jj = ii:n
C = myBasis{jj};
t = full( B(:).'*C(:) ); % equivalent to trace(B'*C)!
Q(ii,jj) = t;
Q(jj,ii) = t;
end
end
PS,
It is best not to use i and j as variable names in Matlab.
PPS,
You should notice that ' operator in Matlab is not matrix transpose, but hermitian conjugate, for actual transpose you need to use .'. In most cases complex numbers are not involved and there is no difference between the two operators, but once complex data is introduced, confusing between the two operators makes debugging quite a mess...
Well, a couple of thoughts
1) Basic stuff: A'*B = (B'*A)' and trace(A) = trace(A'). Well, only this trick cut your calculations by almost 50%. Your Q(i,j) matrix is symmetric, and you only need to calculate n(n+1)/2 terms (and not n²)
2) To calculate the trace you don't need to calculate every term of B'*C, just the diagonal. Nevertheless, I don't know if it's easy to create a script in Matlab that is actually faster then just calculating B'*C (MatLab is pretty fast with matrix operations).
But I would definitely implement (1)
I'm looking to duplication a 784x784 matrix in matlab along a 3rd axis. The following code seems to work:
mat = reshape(repmat(mat, 1,10000),784,784,10000);
Unfortunately, it takes so long to run it's worthless (changing the 10,000s to 1000 makes it take a few minutes, and using 10,000 makes my whole machine freeze up practically). is there a faster way to do this?
For reference, I'm looking to use mvnpdf on 10,000 vectors each of length 784, using the same covariance matrix for each. So my final call looks like
mvnpdf(X,mu,mat)
%size(X) = (10000,784), size(mu) = (10000,784), size(mat) = 784,784,10000
If there's a way to do this that's not repeating the covariance matrix 10,000 times, that'd be helpful too. Thanks!
For replication in more than 2 dimensions, you need to supply the replication counts as an array:
out = repmat(mat,[1,1,10000])
Creating a 784x784 matrix 10,000 times isn't going to take advantage of the vectorization in MATLAB, which is going to be more useful for small arrays. Avoiding a for loop also won't help too much, given the following:
The main speedup you can gain here is by computing the inverse of the covariance matrix once, and then computing the pdf yourself. The inverse of sigma takes O(n^3), and you are needlessly doing that 10,000 times. (Also, the square root determinant can be precomputed.) For reference, the PDF of the multivariate normal distribution is computed as follows:
http://en.wikipedia.org/wiki/Multivariate_normal_distribution#Properties
Better to just compute the inverse once, and then compute z = x - mu for each value, then doing z'Sz for each pdf value, and applying a simple function and a constant. But wait! You can vectorize that, too.
I don't have MATLAB in front of me, but this is basically what you need to do, and it'll run in an instant.
s = inv(sigma);
c = -0.5*log(det(s)) - (k/2)*log(2*pi);
z = x - mu; % 10000 x 784 matrix
ans = exp( c - 0.5 .* dot(z*s, z, 2) ); % 10000 x 1 vector
I do not have much experience with Matlab, which is why I am asking this question here to get some directions to start looking.
I have this code:
A = A - A(:,i)*(A(i,:)/(delta + A(i,i)));
The A matrix is 216x31285, which means this computation is quite expensive for the number of times it is executed. It executes for all rows (216) for each data set (28) so naturally the 0.192 seconds it takes is quite a bit. Any ideas on how I can speed this up?
The easiest way to speed up code in MATLAB is to take advantage of vectorization to get rid of loops. Instead of looping each line or column, attempt to apply whole operations instead, e.g., to multiply all lines of a matrix M with appropriate vector V do:
M = magic(5)
V = rand(5,1)
M = M.*repmat(V,[size(M,1) 1])
will in general work much faster than the equivalent for loop.
The actual implementation of vectorization is specific to each problem but very useful operators are the elementwise operators, e.g.: .* ./ .^, etc. Also the repmat function is extremely useful.
In your case however you are applying a recursive operation to matrix A:
A = A - f(i) = A - (prevA - f(i-1)) = ...
which means that you cannot apply all iterations at once as you would normally do when vectorizing code.
In other words, at each iteration i, matrix A depends on matrix A at the previous iteration and therefore it is impossible to operate all iterations at once using the equation you provide.
An addendum to #Pedro's answer. If you have such a recursive equation you can usually trade space for time.
In the first step you update a matrix, say B (following Pedro's notation):
B = A - f(i)
in the second step you update A:
A = B - f(i)
in the third step, well that's the same as the first step. And so on.