How to compare algorithms being solved by lsqnonlin() - matlab

I have multiple algorithms trying to solve the same problem using lsqnonlin. The last 3 have one parameter fixed. How do I read the output here.
a) What does funCount mean?
b) Does having lower step size mean better result?
c) If firstOrderOpt is closer to 0, is it a better result?
Algo
Iterations
FunCount
StepSize
FirstOrderOpt
1
10
69
4.00E-10
3.00E-07
2
10
68
2.00E-09
2.00E-07
3
12
65
6.00E-11
1.00E-08
4
10
69
4.00E-10
3.00E-07
5
10
68
2.00E-09
2.00E-07
6
12
65
6.00E-11
1.00E-08

From the documentation, the fields of the lsqnonline output are
Field name
Meaning
irstorderopt
Measure of first-order optimality
iterations
Number of iterations taken
funcCount
The number of function evaluations
cgiterations
Total number of PCG iterations (trust-region-reflective algorithm only)
stepsize
Final displacement in x
algorithm
Optimization algorithm used
message
Exit message
Specifically addressing your questions,
a) funcCount is the number of times your input function was evaluated to obtain the result.
b) As you approach the optimal solution, a smaller step size may be needed to avoid "jumping" straight over it. It isn't really a good measure of having a "better result", but you should expect it to be small (otherwise you could be skipping over optima) but not so small that you get into the noise of numerical precision errors within your function.
A smaller stepsize will also slow the solver down, and likely lead to more iterations - you can see this reflected in your results table as the two rows with stepsize of order 1e-11 have more iterations than the others.
The StepSize is somewhat problem dependent, the related MathWorks documentation on Tolerances and Stopping Criteria may be helpful.
c) Please read the MathWorks documentation on the First Order Optimality Measure.
First-order optimality is a measure of how close a point x is to optimal. Most Optimization Toolbox™ solvers use this measure, though it has different definitions for different algorithms. First-order optimality is a necessary condition, but it is not a sufficient condition. In other words:
The first-order optimality measure must be zero at a minimum.
A point with first-order optimality equal to zero is not necessarily a minimum.
So a smaller firstOrderOpt indicates a better result, but does not necessarily give an estimate to the true optimum - if we had that then we would likely already know the true answer without the need for an optimser!
There is an OptimalityTolerance option within lsqnonlin, so you have control over how small the first order optimality must be for the solver to stop. Again, please see the docs.

Related

Explain the intuition for the tol paramer in scipy differential evolution

I am using the differential evolution optimizer in scipy and I don't understand the intuition behind the tol argument. Specifically is say in the documentation:
tol: float, optional
When the mean of the population energies, multiplied by tol, divided
by the standard deviation of the population energies is greater than 1
the solving process terminates:
convergence = mean(pop) * tol / stdev(pop) > 1
What does setting tol represent from a user perspective?
Maybe the formula in the documentation is easier to understand in the following form (see lines 508 and 526 in the code):
std(population_energies) / mean(population_energies) < tol
It means that convergence is reached when the standard deviation of the energies for each individual in the population, normed by the average, is smaller than the given tolerance value.
The optimization algorithm is iterative. At every iteration a better solution is found. The tolerance parameters is used to define a stopping condition. The stopping condition is actually that all the individuals (parameter sets) have approximately the same energy, i.e. the same cost function value. Then, the parameter set giving the lowest energy is returned as a solution.
It also implies that all the individuals are relatively close to each other in the parameter space. So, no better solution can be expected on the following generations.

Anomaly in accuracy calculation

I am classifying a dataset with four classes using pretrained VGG19. To calculate accuracy, I used this formula:
accuracy = sum(predictedLabels==testLabels)/numel(predictedLabels) --Eq 1
Then I calculated the confusion matrix using:
confMat = confusionmat(testLabels, predictedLabels) **--Eq 2**
From which I got a matrix with 4 rows and 4 columns since I had 4 classes.
Now, we know that the accuracy formula is also:
Accuracy=TP+TN/(TP+TN+FP+FN) **Eq-3**
So I also calculated Accuracy from my confusion matrix formed through above Eq. 2. where
TP=value in (row==column),
FP=sum of column-TP,
FN=sum of row-TP,
TN=sum of the diagonal-TP
If I am doing above steps alright, then my confusion is that I am getting different accuracies from two methods Eq 1 and Eq 3. The accuracy I am getting with Eq. 1 is equivalent to the formula TP/(TP+TN). so, If this is the case, then Eq. 1 is the wrong formula for calculating accuracy. But, this formula has been used across all matlab deep learning codes.
So, MATLAB is doing something wrong (which has the probability 0, I know) or I am doing something wrong. But, unfortunately, I am unable to pinpoint my mistake.
Now, the question is,
Am I doing it wrong? Where am I missing the step? How to correct it? What is the logical explanation of this anomaly?
EDIT
This anomaly in accuracy calculation happens due to class imbalance problem. that is when, there are different number of samples in each class. therefore, the regular accuracy formula in Eq. 3 will not work in such cases.
The main issue is that negative and positive is for prediction (is this a cat or not), while you are doing classification with more than two categories. The classifier doesn't give you positive and negative (for is it a cat prediction), so it is not possible to relate to answers as true positive or false positive etc. Therefore equation 3 is meaningless, and so is the method for computing TP, TN etc. For example, if TP is row=column as you defined, then these are the accurate values in the diagonal of confMat. But what is TN? According to your definition it is TP (which is the diagonal) minus the diagonal. I hope this helps put things on the write track.

step size tolerance violated using fmincon

I'm trying to solve a non-linear constraint optimization problem using MatLab's fmincon function with SQP algorithm. This solver has been successfully applied on my problem, as I found out during my literature research.
I know my problem's solution, but fmincon struggles to find it reliably. When running the optimization a 100 times with randomly generated start values within my boundaries, I got about 40 % good results. 'good' means that the results are that close to the optimum that I would accept it, although those 'good' results correspond with different ExitFlags. Most common are Exit Flags -2 and 2:
ExitFlag = 2
Local minimum possible. Constraints satisfied.
fmincon stopped because the size of the current step is less than the selected value of the step size tolerance and constraints are satisfied to within the selected value of the constraint tolerance.
ExitFlag = -2
No feasible solution found.
fmincon stopped because the size of the current step is less than the selected value of the step size tolerance but constraints are not satisfied to within the selected value of the constraint tolerance.
The 'non-good' results deviate about 2% of the optimal solution and correspond to ExitFlags 2 and -2, as well.
I played around with the tolerances, but without success. When relaxing the constraint tolerance the number of ExitFlag -2 decreases and some ExitFlag 1 cases occur, but consequently the deviation from the optimal solution rises.
A big problem seems to be the step size which violates its tolerance. Often the solver exits after 2 or 3 iterations because of too small step size / norm of step size (relative change in X is below TolX).Is there a way to counteract these problems? I'd like to tune the solver In away to get appropriate results reliably.
For your information, the options used:
options=optimset('fmincon');
options=optimset(options,...
'Algorithm','sqp',...
'ScaleProblem','obj-and-constr',...
'TypicalX',[3, 50, 3, 40, 50, 50],...
'TolX',1e-12,...%12
'TolFun',1e-8,...%6
'TolCon',1e-3,...%6
'MaxFunEvals',1000,... %1000
'DiffMinChange',1e-10);

Why eigs( 'lm') is much faster than eigs('sm')

I use eigs to calculate the eigen vectors of sparse square matrices which are large (tens of thousands).
What I want is the smallest set of eigen vectors.
But
eigs(A, 10, 'sm') % Note: A is the matrix
runs very slow.
However, using eigs(A, 10, 'lm') gives me the answer relatively faster.
And as I tried, replacing 10 with A_width in eigs(A, 10, 'lm') so that this includes all the eigen vectors, doesn't solve this problem, 'cause this make it the as slow as using 'sm'.
So, I want to know why calculating the smallest vectors(using 'sm') is much slower than calculating the largest?
BTW, if you have any idea about how to use eigs with 'sm' as fast as with 'lm', please tell me that.
The algorithm used in pretty much any standard eigs function is (some variation of) the Lanczos algorithm. It is iterative and the first iterations give you the largest eigenvalues. This explains pretty much every observation you make:
Largest eigenvalues take the least amount of iterations,
Smallest eigenvalues take the maximum amount of iterations,
All eigenvalues also take the maximum amount of iterations.
There are tricks to "fool" eigs into calculating the smallest eigenvalues by actually making them the largest eigenvalues of another problem. This is usually accomplished by a shift parameter. Skimming over the Matlab documentation for eigs, I see that they have a sigma parameter, which might help you. Note the same documentation recommends proper eig if the matrix fits into memory, as eigs has its numerical quirks.
Since eigs is actually an m-file function, we can profile it. I have run a couple of basic tests, and it depends very much on the nature of the data in the matrix. If we run the profiler separately on the following two lines of code:
eigs(eye(1000), 10, 'lm'), and
eigs(eye(1000), 10, 'sm'),
then in the first instance it calls arpackc (the main function that does the work - according to the comments in eigs it's probably from here) a total of 22 times. In the second instance it is called 103 times.
On the other hand, trying it with
eigs(rand(1000), 10, 'lm'), and
eigs(rand(1000), 10, 'sm'),
I get results where the 'lm' option consistently calls arpackc many more times than the sm option.
I'm afraid I don't know the details of the algorithm, and so can't explain it in any deeper mathematical sense, but the page that I linked suggests ARPACK is best for matrices with some structure. Since matrices generated by rand have little structure, it is probably safe to assume the latter behaviour I described is not what you'd expect under normal operating conditions.
In short: it simply takes the algorithm more iterations to converge when you ask it for the smallest eigenvalues of a structured matrix. This being an iterative process, however, it very much depends on the actual data you give it.
Edit: There is a wealth of information and references about this method here, and the key to understanding exactly why this happens is surely contained somewhere therein.
The reason is actually much more simple and due to the basics of solving large sparse eigenvalue problems. These are all based on solving:
(1) A x = lam x
Most solution methods use some power law (e.g. a Krylov subspace spanned in both the Lanczos and Arnoldi methods)
The thing is that the a power series converge to the largest eigenvalue of (1). Therefore we have that the largest eigenvalues are found by the subspace spanned by: K^k = {A*r0,....,A^k*r0}, which requires only matrix vector multiplications (cheap).
To find the smallest, we have to reformulate (1) as follows:
(2) 1/lam x = A^(-1) x or A^(-1) x = invlam x
Now solving for the largest eigenvalue of (2) is equivalent to finding the smallest eigenvalue of (1). In this case the subspace is spanned by K^k = {A^(-1)*r0,....,A^(-k)*r0}, which requires solving several linear system (expensive!).

Minimization of L1-Regularized system, converging on non-minimum location?

This is my first post to stackoverflow, so if this isn't the correct area I apologize. I am working on minimizing a L1-Regularized System.
This weekend is my first dive into optimization, I have a basic linear system Y = X*B, X is an n-by-p matrix, B is a p-by-1 vector of model coefficients and Y is a n-by-1 output vector.
I am trying to find the model coefficients, I have implemented both gradient descent and coordinate descent algorithms to minimize the L1 Regularized system. To find my step size I am using the backtracking algorithm, I terminate the algorithm by looking at the norm-2 of the gradient and terminating if it is 'close enough' to zero(for now I'm using 0.001).
The function I am trying to minimize is the following (0.5)*(norm((Y - X*B),2)^2) + lambda*norm(B,1). (Note: By norm(Y,2) I mean the norm-2 value of the vector Y) My X matrix is 150-by-5 and is not sparse.
If I set the regularization parameter lambda to zero I should converge on the least squares solution, I can verify that both my algorithms do this pretty well and fairly quickly.
If I start to increase lambda my model coefficients all tend towards zero, this is what I expect, my algorithms never terminate though because the norm-2 of the gradient is always positive number. For example, a lambda of 1000 will give me coefficients in the 10^(-19) range but the norm2 of my gradient is ~1.5, this is after several thousand iterations, While my gradient values all converge to something in the 0 to 1 range, my step size becomes extremely small (10^(-37) range). If I let the algorithm run for longer the situation does not improve, it appears to have gotten stuck somehow.
Both my gradient and coordinate descent algorithms converge on the same point and give the same norm2(gradient) number for the termination condition. They also work quite well with lambda of 0. If I use a very small lambda(say 0.001) I get convergence, a lambda of 0.1 looks like it would converge if I ran it for an hour or two, a lambda any greater and the convergence rate is so small it's useless.
I had a few questions that I think might relate to the problem?
In calculating the gradient I am using a finite difference method (f(x+h) - f(x-h))/(2h)) with an h of 10^(-5). Any thoughts on this value of h?
Another thought was that at these very tiny steps it is traveling back and forth in a direction nearly orthogonal to the minimum, making the convergence rate so slow it is useless.
My last thought was that perhaps I should be using a different termination method, perhaps looking at the rate of convergence, if the convergence rate is extremely slow then terminate. Is this a common termination method?
The 1-norm isn't differentiable. This will cause fundamental problems with a lot of things, notably the termination test you chose; the gradient will change drastically around your minimum and fail to exist on a set of measure zero.
The termination test you really want will be along the lines of "there is a very short vector in the subgradient."
It is fairly easy to find the shortest vector in the subgradient of ||Ax-b||_2^2 + lambda ||x||_1. Choose, wisely, a tolerance eps and do the following steps:
Compute v = grad(||Ax-b||_2^2).
If x[i] < -eps, then subtract lambda from v[i]. If x[i] > eps, then add lambda to v[i]. If -eps <= x[i] <= eps, then add the number in [-lambda, lambda] to v[i] that minimises v[i].
You can do your termination test here, treating v as the gradient. I'd also recommend using v for the gradient when choosing where your next iterate should be.