This question might be too broad to be posted here but I'll try to be as specific as possible. If you still consider it to be too broad, I'll simply delete it.
Have a look at the EDIT in the bottom for my final thoughts on the subject.
Also have a look at Ander Biguri 's answer if you have access to the parallel computing toolbox and have an NVIDIA GPU.
My problem :
I'm solving dynamic equations by using a Newmark scheme (2nd order implicit), which involves solving a lot of linear systems of the form A*x=b for x.
I've already optimized all the code that doesn't involve solving linear systems. As it stands now, the linear systems solving take up to 70% of the calculation time in the process.
I've though using MATLAB's linsolve, but my matrix A doesn't have any of the properties that could be used as opts input for linsolve.
The idea :
As seen in the documentation of linsolve :
If A has the properties in opts, linsolve is faster than mldivide,
because linsolve does not perform any tests to verify that A has the
specified properties
As far as I know, by using mldivide, MATLAB will use LU decomposition as my matrix A doens't have any specific property except for being square.
My question :
So I'm wondering if I'd gain some time by first decomposing A using MATLAB's lu, and then feed these to linsolve in order to solve x = U\(L\b) with opts being respectively upper and lower triangular.
That way I'd prevent MATLAB of doing all the properties checking that takes place during the mldivide process.
Note : I'm absolutely not expecting a huge time gain. But on calculations that take up to a week, even 2% matter..
Now why don't I try this myself you may ask? Well I've got calculations running until tuesday approximatively, and I'd want to ask if someone has already tried this and gained time, getting rid of the overhead due to matrix property checking by mldivide.
Toy example :
A=randn(2500);
% Getting A to be non singular
A=A.'*A;
x_=randn(2500,1);
b=A*x_;
clear x_
% Case 1 : mldivide
tic
for ii=1:100
x=A\b;
end
out=toc;
disp(['Case 1 time per iteration :' num2str((out)/100)]);
% Case 2 : LU+linsolve
opts1.LT=true;
opts2.UT=true;
tic;
for ii=1:100
[L,U]=lu(A);
% It seems that these could be directly replaced by U\(L\b) as mldivide check for triangularity first
Tmp=linsolve(L,b,opts1);
x=linsolve(U,Tmp,opts2);
end
out2=toc;
disp(['Case 2 time per iteration :' num2str((out2)/100)]);
EDIT
So I just had the possibility to try a few things.
I missed earlier in the documentation of linsolve that if you don't specify any opts input it will default to using the LU solver, which is what I want. Doing a bit of time testing with it (And taking into account #rayryeng 's remark to "timeit that bad boy"), it saves around 2~3% of processing time when compared to mldivide, as shown below. It's not a huge deal in term of time gain, but it's something non neglictible on calculations that take up to a week.
timeit results on a 1626*1626 linear system:
mldivide :
t1 =
0.102149773097083
linsolve :
t2 =
0.099272037768204
relative : 0.028171725121151
I know you do not have NVIDIA GPU and parallel computing toolbox, but if you had, this would work:
If you replace the second test in your code by:
tic;
for ii=1:10
A2=gpuArray(A); % so we account for memory management
b2=gpuArray(b);
x=A2\b2;
end
out2=toc;
My PC says (CPU vs GPU)
Case 1 time per iteration :0.011881
Case 2 time per iteration :0.0052003
Related
First off, I'm not sure if this is the best place to post this, but since there isn't a dedicated Matlab community I'm posting this here.
To give a little background, I'm currently prototyping a plasma physics simulation which involves triple integration. The innermost integral can be done analytically, but for the outer two this is just impossible. I always thought it's best to work with values close to unity and thus normalized the my innermost integral such that it is unit-less and usually takes values close to unity. However, compared to an earlier version of the code where the this innermost integral evaluated to values of the order of 1e-50, the numerical double integration, which uses the native Matlab function integral2 with target relative tolerance of 1e-6, now requires around 1000 times more function evaluations to converge. As a consequence my simulation now takes roughly 12h instead of the previous 20 minutes.
Question
So my questions are:
Is it possible that the faster convergence in the older version is simply due to the additional evaluations vanishing as roundoff errors and that the results thus arn't trustworthy even though it passes the 1e-6 relative tolerance? In the few tests I run the results seemed to be the same in both versions though.
What is the best practice concerning the normalization of the integrand for numerical integration?
Is there some way to improve the convergence of numerical integrals, especially if the integrand might have singularities?
I'm thankful for any help or insight, especially since I don't fully understand the inner workings of Matlab's integral2 function and what should be paid attention to when using it.
If I didn't know any better I would actually conclude, that the integrand which is of the order of 1e-50 works way better than one of say the order of 1e+0, but that doesn't seem to make sense. Is there some numerical reason why this could actually be the case?
TL;DR when multiplying the function to be integrated numerically by Matlab 's integral2 with a factor 1e-50 and then the result in turn with a factor 1e+50, the integral gives the same result but converges way faster and I don't understand why.
edit:
I prepared a short script to illustrate the problem. Here the relative difference between the two results was of the order of 1e-4 and thus below the actual relative tolerance of integral2. In my original problem however the difference was even smaller.
fun = #(x,y,l) l./(sqrt(1-x.*cos(y)).^5).*((1-x).*sin(y));
x = linspace(0,1,101);
y = linspace(0,pi,101).';
figure
surf(x,y,fun(x,y,1));
l = linspace(0,1,101); l=l(2:end);
v1 = zeros(1,100); v2 = v1;
tval = tic;
for i=1:100
fun1 = #(x,y) fun(x,y,l(i));
v1(i) = integral2(fun1,0,1,0,pi,'RelTol',1e-6);
end
t1 = toc(tval)
tval = tic;
for i=1:100
fun1 = #(x,y) 1e-50*fun(x,y,l(i));
v2(i) = 1e+50*integral2(fun1,0,1,0,pi,'RelTol',1e-6);
end
t2 = toc(tval)
figure
hold all;
plot(l,v1);
plot(l,v2);
plot(l,abs((v2-v1)./v1));
I have a linear system of about 2000 sparse equations in Matlab. For my final result, I only really need the value of one of the variables: the other values are irrelevant. While there is no real problem in simply solving the equations and extracting the correct variable, I was wondering whether there was a faster way or Matlab command. For example, as soon as the required variable is calculated, the program could in principle stop running.
Is there anyone who knows whether this is at all possible, or if it would just be easier to keep solving the entire system?
Most of the computation time is spent inverting the matrix, if we can find a way to avoid completely inverting the matrix then we may be able to improve the computation time. Lets assume I'm only interested in the solution for the last variable x(N). Using the standard method we compute
x = A\b;
res = x(N);
Assuming A is full rank, we can instead use LU decomposition of the augmented matrix [A b] to get x(N) which looks like this
[~,U] = lu([A b]);
res = U(end,end-1)/U(end,end);
This is essentially performing Gaussian elimination and then solving for x(N) using back-substitution.
We can extend this to find any value of x by swapping the columns of A before LU decomposition,
x_index = 123; % the index of the solution we are interested in
A(:,[x_index,end]) = A(:,[end,x_index]);
[~,U] = lu([A b]);
res = U(end,end)/U(end,end-1);
Bench-marking performance in MATLAB2017a with 10,000 random 200 dimensional systems we get a slight speed-up
Total time direct method : 4.5401s
Total time LU method : 3.9149s
Note that you may experience some precision issues if A isn't well conditioned.
Also, this approach doesn't take advantage of the sparsity of A. In my experiments even with 2000x2000 sparse matrices everything significantly slowed down and the LU method is significantly slower. That said full matrix representation only requires about 30MB which shouldn't be a problem on most computers.
If you have access to theory manuals on NASTRAN, I believe (from memory) there is coverage of partial solutions of linear systems. Also try looking for iterative or tri diagonal solvers for A*x = b. On this page, review the pqr solution answer by Shantachhani. Another reference.
I am having a hard time grasping how to count FLOPs. One moment I think I get it, and the next it makes no sense to me. Some help explaining this would greatly be appreciated. I have looked at all other posts about this topic and none have completely explained in a programming language I am familiar with (I know some MATLAB and FORTRAN).
Here is an example, from one of my books, of what I am trying to do.
For the following piece of code, the total number of flops can be written as (n*(n-1)/2)+(n*(n+1)/2) which is equivalent to n^2 + O(n).
[m,n]=size(A)
nb=n+1;
Aug=[A b];
x=zeros(n,1);
x(n)=Aug(n,nb)/Aug(n,n);
for i=n-1:-1:1
x(i) = (Aug(i,nb)-Aug(i,i+1:n)*x(i+1:n))/Aug(i,i);
end
I am trying to apply the same principle above to find the total number of FLOPs as a function of the number of equations n in the following code (MATLAB).
% e = subdiagonal vector
% f = diagonal vector
% g = superdiagonal vector
% r = right hand side vector
% x = solution vector
n=length(f);
% forward elimination
for k = 2:n
factor = e(k)/f(k‐1);
f(k) = f(k) – factor*g(k‐1);
r(k) = r(k) – factor*r(k‐1);
end
% back substitution
x(n) = r(n)/f(n);
for k = n‐1:‐1:1
x(k) = (r(k)‐g(k)*x(k+1))/f(k);
end
I'm by no means expert at MATLAB but I'll have a go.
I notice that none of the lines of your code index ranges of your vectors. Good, that means that every operation I see before me is involving a single pair of numbers. So I think the first loop is 5 FLOPS per iteration, and the second is 3 per iteration. And then there's that single operation in the middle.
However, MATLAB stores everything by default as a double. So the loop variable k is itself being operated on once per loop and then every time an index is calculated from it. So that's an extra 4 for the first loop and 2 for the second.
But wait - the first loop has 'k-1' twice, so in theory one could optimise that a bit by calculating and storing that, reducing the number of FLOPs by one per iteration. The MATLAB interpreter is probably able to spot that sort of optimisation for itself. And for all I know it can work out that k could in fact be an integer and everything is still okay.
So the answer to your question is that it depends. Do you want to know the number of FLOPs the CPU does, or the minimum number expressed in your code (ie the number of operations on your vectors alone), or the strict number of FLOPs that MATLAB would perform if it did no optimisation at all? MATLAB used to have a flops() function to count this sort of thing, but it's not there anymore. I'm not an expert in MATLAB by any means, but I suspect that flops() has gone because the interpreter has gotten too clever and does a lot of optimisation.
I'm slightly curious to know why you wish to know. I used to use flops() to count how many operations a piece of maths did as a crude way of estimating how much computing grunt I'd need to make it work in real time written in C.
Nowadays I look at the primitives themselves (eg there's a 1k complex FFT, that'll be 7us on that CPU according to the library datasheet, there's a 2k vector multiply, that'll be 2.5us, etc). It gets a bit tricky because one has to consider cache speeds, data set sizes, etc. The maths libraries (eg fftw) themselves are effectively opaque so that's all one can do.
So if you're counting the FLOPs for that reason you'll probably not get a very good answer.
I am trying to implement a routine for fitting electrophoretic data from my experiments.
The aim is to derive kinetic parameters for the interaction of biomoecules from the relative areas of peaks in the electropherogram, based on the areas of the peaks in the dataset.
Since all relevant differential equations are known and since the set of equations has an analytical solution, as described here:
Analytical solution manuscript
I set about entering the relevant equations (6, 8, 13, ... from the referenced manuscript) in matlab.
The thus created function works and I can use it to simulate electropherograms of interacting species.
Obviuously, I now would like to use the function to fit experimental data and retrieve the parameters (8 in total, Va, Vc, MUa, MUc, k, A0, C0, baseline noise).
Some of these will obviously be correlated. Example values might be (to give an idea of their magnitude):
params0 = [ ...
8.44E-02; ... % Va
1.25E-01; ... % Vc
5.32E-05; ... % MUa
8.87E-05; ... % MUc
4.48E-03; ... % k
6.06E-01; ... % A0
3.00E-00; ... % C0
4.64E-03 ... % noise
];
My problem is, if I supply experimental data and try something like lsqcurvefit:
[x,resnorm,residual] = lsqcurvefit(#(param,xdata) Electropherogram2(param,xdata,column), params0, time, ydata,lb, ub);
I often get very poor results because I either run out of iterations, I hit some (obviously poorly fitting) local minimum or whatever...
Only if I tinker a lot with the starting values and the allowed intervals (i.e. because I know likely values through other experiments) do I end up with more or less decent fits, but even then, fits are not as good as reported in the original manuscript (fig. 3).
The authors of that manuscript used Excel solver and were kind enough to provide the original data used in Fig. 3 but still I cannot seem to end up with fits as good as theirs without nearly literally supplying the nearly correct starting values.
I am not experienced enough to know what I could tweak to make this process less trial-and-error.
Would something like the global optimization toolbox help me?
Any tips are welcome...
In the mentioned paper ("Analytical solution manuscript") it is implied that the free optimization parameters are five (Va, Vc, MUa, MUc, k) and not eight because the (Aeq/Ceq) ratio can be computed from their representative equations, eq. 8 for Aeq and (obviously) eq. 6 for Ceq.
In my opinion, what's even more troubling is the appearance of the following products in the model, comprised of the free optimization parameters:
k and Va in eq. 12
MUc and Va in the equation for epsilon_A in eq. 12
MUa and Vc in the equation for epsilon_A in eq. 12
In general, non-linear optimization algorithms have a legitimate trouble in optimizing the free parameters when pairs of the latter appear as products in the non-linear model.
Sorry if this is obvious but I searched a while and did not find anything (or missed it).
I'm trying to solve linear systems of the form Ax=B with A a 4x4 matrix, and B a 4x1 vector.
I know that for a single system I can use mldivide to obtain x: x=A\B.
However I am trying to solve a great number of systems (possibly > 10000) and I am reluctant to use a for loop because I was told it is notably slower than matrix formulation in many MATLAB problems.
My question is then: is there a way to solve Ax=B using vectorization with A 4x4x N and B a matrix 4x N ?
PS: I do not know if it is important but the B vector is the same for all the systems.
You should use a for loop. There might be a benefit in precomputing a factorization and reusing it, if A stays the same and B changes. But for your problem where A changes and B stays the same, there's no alternative to solving N linear systems.
You shouldn't worry too much about the performance cost of loops either: the MATLAB JIT compiler means that loops can often be just as fast on recent versions of MATLAB.
I don't think you can optimize this further. As explained by #Tom, since A is the one changing, there is no benefit in factoring the various A's beforehand...
Besides the looped solution is pretty fast given the dimensions you mention:
A = rand(4,4,10000);
B = rand(4,1); %# same for all linear systems
tic
X = zeros(4,size(A,3));
for i=1:size(A,3)
X(:,i) = A(:,:,i)\B;
end
toc
Elapsed time is 0.168101 seconds.
Here's the problem:
you're trying to perform a 2D operation (mldivide) on a 3d matrix. No matter how you look at it, you need reference the matrix by index which is where the time penalty kicks in... it's not the for loop which is the problem, but it's how people use them.
If you can structure your problem differently, then perhaps you can find a better option, but right now you have a few options:
1 - mex
2 - parallel processing (write a parfor loop)
3 - CUDA
Here's a rather esoteric solution that takes advantage of MATLAB's peculiar optimizations. Construct an enormous 4k x 4k sparse matrix with your 4x4 blocks down the diagonal. Then solve all simultaneously.
On my machine this gets the same solution up to single precision accuracy as #Amro/Tom's for-loop solution, but faster.
n = size(A,1);
k = size(A,3);
AS = reshape(permute(A,[1 3 2]),n*k,n);
S = sparse( ...
repmat(1:n*k,n,1)', ...
bsxfun(#plus,reshape(repmat(1:n:n*k,n,1),[],1),0:n-1), ...
AS, ...
n*k,n*k);
X = reshape(S\repmat(B,k,1),n,k);
for a random example:
For k = 10000
For loop: 0.122570 seconds.
Giant sparse system: 0.032287 seconds.
If you know that your 4x4 matrices are positive definite then you can use chol on S to improve the accuracy.
This is silly. But so is how slow matlab's for loops still are in 2015, even with JIT. This solution seems to find a sweet spot when k is not too large so everything still fits into memory.
I know this post is years old now, but I'll contribute my two cents anyway. You CAN put all of your A matricies into a bigger block diagonal matrix, where there will be 4x4 blocks on the diagonal of a big matrix. The right hand side will be all of your b vectors stacked on top of each other over and over. Once you set this up, it is represented as a sparse system, and can be efficiently solved with the algorithms mldivide chooses. The blocks are numerically decoupled, so even if there are singular blocks in there, the answers for the nonsingular blocks should be right when you use mldivide. There is a code that took this approach on MATLAB Central:
http://www.mathworks.com/matlabcentral/fileexchange/24260-multiple-same-size-linear-solver
I suggest experimenting to see if the approach is any faster than looping. I suspect it can be more efficient, especially for large numbers of small systems. In particular, if there are nice formulas for the coefficients of A across the N matricies, you can build the full left hand side using MATLAB vector operations (without looping), which could give you additional cost savings. As others have noted, vectorized operations aren't always faster, but they often are in my experience.