I am trying to find a minimum using fmincon in MATLAB, and I am facing a following problem:
Optimization completed because the size of the gradient at the initial point
is less than the default value of the function tolerance.
My objective function's surface shows "steps", and therefore it has the same values over certain ranges of input variables (the size of the gradient is zero, if I am correct):
When moving from the initial point, the solver doesn't see any changes in the objective function's value, and finishes the optimization:
Iteration Func-count f(x) Step-size optimality
0 3 581.542 0
Initial point is a local minimum.
Optimization completed because the size of the gradient at the initial point
is less than the default value of the function tolerance.
Is there any way make the solver move forward when the objective function keeps its value unchanged (until the objective function starts to increase)?
Thanks for your help.
I post my extended comment as an answer in the hope that it will be easier for future answer seekers to find the solution:
Probably you would get reasonable results with a non-gradient based solver, e.g. ga, if the evaluation of the objective function is not costly. These are not dependent on the gradient and performing well on non-smooth functions. It is also worth to read the following guide before selecting solver algorithm: How to choose solver.
The answer is right there :
Initial point is a local minimum.
The point you are giving as the initial point is already a local minimum. So the algorithm finds that minimum and sticks there.
In order to find other local minimum or maybe the global one, change the initial points to something else far from the local minimum.
In order to find the global minimum use a global optimization technique.
Related
Coding the lrCostFunction.m in Octave for the course Machine Learning in Coursera (Neural Networks) "ex3". I don't get why we need to obtain "grad". Anybody has a clue?
Thx in advance
Grad refers to the 'gradient' of the cost function.
Your objective is to minimize the cost function. In order to do that, most optimisation algorithms also need to know the equation that gives its gradient at each point, so that they can use it to move the next search in a direction that makes it more likely that the cost function will be at a lower value.
Specifically, since the gradient at a point is defined as the direction of maximal rate of 'increase' in the underlying function, typically optimisation algorithms use the current point and take a small step in the reverse direction to that indicated by the gradient.
In any case, since you're asking an abstract optimisation algorithm to optimise parameters such that a cost function is minimized by making use of its gradient at each step, you need to provide all of those inputs to the algorithm. Hence why you need to calculate 'grad' value as well as the value of the cost function itself at each point.
What is the concept behind taking the derivative? It's interesting that for somehow teaching a system, we have to adjust its weights. But why are we doing this using a derivation of the transfer function. What is in derivation that helps us. I know derivation is the slope of a continuous function at a given point, but what does it have to do with the problem.
You must already know that the cost function is a function with the weights as the variables.
For now consider it as f(W).
Our main motive here is to find a W for which we get the minimum value for f(W).
One of the ways for doing this is to plot function f in one axis and W in another....... but remember that here W is not just a single variable but a collection of variables.
So what can be the other way?
It can be as simple as changing values of W and see if we get a lower value or not than the previous value of W.
But taking random values for all the variables in W can be a tedious task.
So what we do is, we first take random values for W and see the output of f(W) and the slope at all the values of each variable(we get this by partially differentiating the function with the i'th variable and putting the value of the i'th variable).
now once we know the slope at that point in space we move a little further towards the lower side in the slope (this little factor is termed alpha in gradient descent) and this goes on until the slope gives a opposite value stating we already reached the lowest point in the graph(graph with n dimensions, function vs W, W being a collection of n variables).
The reason is that we are trying to minimize the loss. Specifically, we do this by a gradient descent method. It basically means that from our current point in the parameter space (determined by the complete set of current weights), we want to go in a direction which will decrease the loss function. Visualize standing on a hillside and walking down the direction where the slope is steepest.
Mathematically, the direction that gives you the steepest descent from your current point in parameter space is the negative gradient. And the gradient is nothing but the vector made up of all the derivatives of the loss function with respect to each single parameter.
Backpropagation is an application of the Chain Rule to neural networks. If the forward pass involves applying a transfer function, the gradient of the loss function with respect to the weights will include the derivative of the transfer function, since the derivative of f(g(x)) is f’(g(x))g’(x).
Your question is a really good one! Why should I move the weight more in one direction when the slope of the error wrt. the weight is high? Does that really make sense? In fact it does makes sense if the error function wrt. the weight is a parabola. However it is a wild guess to assume it is a parabola. As rcpinto says, assuming the error function is a parabola, make the derivation of the a updates simple with the Chain Rule.
However, there are some other parameter update rules that actually addresses this, non-intuitive assumption. You can make update rule that takes the weight a fixed size step in the down-slope direction, and then maybe later in the training decrease the step size logarithmic as you train. (I'm not sure if this method has a formal name.)
There are also som alternative error function that can be used. Look up Cross Entropy in you neural network text book. This is an adjustment to the error function such that the derivative (of the transfer function) factor in the update rule cancels out. Just remember to pick the right cross entropy function based on you output transfer function.
When I first started getting into Neural Nets, I had this question too.
The other answers here have explained the math which makes it pretty clear that a derivative term will appear in your calculations while you are trying to update the weights. But all of those calculations are being done in order to implement Back-propagation, which is just one of the ways of updating weights! Now read on...
You are correct in assuming that at the end of the day, all a neural network tries to do is update its weights to fit the data you feed into it. Within this statement lies your answer too. What you are getting confused with here is the idea of the Back-propagation algorithm. Many textbooks use backprop to update neural nets by default but do not mention that there are other ways to update weights too. This leads to the confusion that neural nets and backprop are the same thing and are inherently connected. This also leads to the false belief that neural nets need backprop to train.
Please remember that Back-propagation is just ONE of the ways out there to train your neural network (although it is the most famous one). Now, you must have seen the math involved in backprop, and hence you can see where the derivative term comes in from (some other answers have also explained that). It is possible that other training methods won't need the derivatives, although most of them do. Read on to find out why....
Think about this intuitively, we are talking about CHANGING weights, the direct mathematical operation related to change is a derivative, makes sense that you should need to evaluate derivatives to change weights.
Do let me know if you are still confused and I'll try to modify my answer to make it better. Just as a parting piece of information, another common misconception is that gradient descent is a part of backprop, just like it is assumed that backprop is a part of neural nets. Gradient descent is just one way to minimize your cost function, there are plenty of others you can use. One of the answers above makes this wrong assumption too when it says "Specifically Gradient Descent". This is factually incorrect. :)
Training a neural network means minimizing an associated "error" function wrt the networks weights. Now there are optimization methods that use only function values (Simplex method of Nelder and Mead, Hooke and Jeeves, etc), methods that in addition use first derivatives (steepest descend, quasi Newton, conjugate gradient) and Newton methods using second derivatives as well. So if you want to use a derivative method, you have to calculate the derivatives of the error function, which in return involves the derivatives of the transfer or activation function.
Back propagation is just a nice algorithm to calculate the derivatives, and nothing more.
Yes, the question was really good, this question was also came in my head while i am understanding the Backpropagation. After doing ForwordPropagation on neural network we do back propagation in network to minimize the total error. And there also many other way to minimize the error.your question is why we are doing derivative in backpropagation, the reason is that, As we all know the meaning of derivative is to find the slope of a function or in other words we can find change of particular thing with respect to particular thing. So here we are doing derivative to minimize the total error with respect to the corresponding weights of the network.
and here by doing the derivation of total error with respect to weights we can find it's slope or in other words we can find what is the change in total error with respect to the small change of the weight, so that we can update the weight to minimize the error with the help of this Gradient Descent formula, that is, Weight= weight-Alpha*(del(Total error)/del(weight)).Or in other words New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters.
Here Alpha is the learning rate which is control the weight update, means if the derivative the - ve than Alpha make it +ve(Becouse of -Alpha in formula) and if +ve it's remain +ve so that weight update goes in +ve direction and it's reflected to minimize the Total error.And also the as derivative part is multiples with Alpha, it's decrees the step size of Alpha when the weight converge to the optimal value of weight(minimum error). Thats why we are doing derivative to minimize the error.
In my model each agent solves a system of ODEs at each tick. I have employed Eulers method (similar to the systems dynamics modeler in NetLogo) to solve these first order ODEs. However, for a stable solution, I am forced to use a very small time step (dt), which means the simulation proceeds very slowly with this method. I´m curious if anyone has advice on a method to solve the ODEs more quickly? I am considering implementing Runge-Kutta (with a larger time step?) as was done here (http://academic.evergreen.edu/m/mcavityd/netlogo/Bouncing_Ball.html). I would also consider using the R extension and using an ODE solver in R. But again, the ODEs are solved by each agent, so I don´t know if this is an efficient method.
I´m hoping someone has a feel for the performance of these methods and could offer some advice. If not, I will try to share what I find out.
In general your idea is correct. For a method of order p to reach a global error level tol over an integration interval of length T you will need a step size in the magnitude range
h=pow(tol/T,1.0/p).
However, not only the discretization error accumulates over the N=T/h steps, but also the floating point error. This gives a lower bound for useful step sizes of magnitude h=pow(T*mu,1.0/(p+1)).
Example: For T=1, mu=1e-15 and tol=1e-6
the Euler method of order 1 would need a step size of about h=1e-6 and thus N=1e+6 steps and function evaluations. The range of step sizes where reasonable results can be expected is bounded below by h=3e-8.
the improved Euler or Heun method has order 2, which implies a step size 1e-3, N=1000 steps and 2N=2000 function evaluations, the lower bound for useful step sizes is 1e-3.
the classical Runge-Kutta method has order 4, which gives a required step size of about h=3e-2 with about N=30 steps and 4N=120 function evaluations. The lower bound is 1e-3.
So there is a significant gain to be had by using higher order methods. At the same time the range where step size reduction results in a lower global error also gets significantly narrower for increasing order. But at the same time the achievable accuracy increases. So one has to knowingly care when the point is reached to leave well enough alone.
The implementation of RK4 in the ball example, as in general for the numerical integration of ODE, is for an ODE system x'=f(t,x), where x is the, possibly very large, state vector
A second order ODE (system) is transformed to a first order system by making the velocities members of the state vector. x''=a(x,x') gets transformed to [x',v']=[v, a(x,v)]. The big vector of the agent system is then composed of the collection of the pairs [x,v] or, if desired, as the concatenation of the collection of all x components and the collection of all v components.
In an agent based system it is reasonable to store the components of the state vector belonging to the agent as internal variables of the agent. Then the vector operations are performed by iterating over the agent collection and computing the operation tailored to the internal variables.
Taking into consideration that in the LOGO language there are no explicit parameters for function calls, the evaluation of dotx = f(t,x) needs to first fix the correct values of t and x before calling the function evaluation of f
save t0=t, x0=x
evaluate k1 = f_of_t_x
set t=t0+h/2, x=x0+h/2*k1
evaluate k2=f_of_t_x
set x=x0+h/2*k2
evaluate k3=f_of_t_x
set t=t+h, x=x0+h*k3
evaluate k4=f_of_t_x
set x=x0+h/6*(k1+2*(k2+k3)+k4)
pow=fsolve(#eqns,pop);
This is the code I am using to solve a 2x2 non-linear system of equations, defined in the function eqns.m.
pop is a 2x1 initialisation vector pretty close to the solution. When I run it, the output says
No solution found.fsolve stopped because the relative size of the current step is less than the default value of the step size tolerance squared, but the vector of function values is not near zero as measured by the default value of the function tolerance.<stopping criteria details>
Any way out? I tried moving the initial point further away from the solution intentionally, still it is not working. How do I set the tolerance or some other parameter? Some posts gave me the impression that supplying the jacobian to matlab can be helpful, but how do I do that? Please note that I need the solution in the form of a code which I can put in a function file to be called repeatedly. I believe the interactive optimtool toolbox would not help here. Any help please?
Also from the documentation, the fsolve can employ three different algorithms. Is any of them more helpful than the others for certain problem structures? Where can I get a comparative study of them, suitable for some non-expert in optimisation?
I'm trying to solve a problem using Matlab's genetic algorithm and fmincon functions where the variables' values do not have single upper and lower bounds. Instead, the variables should be allowed to take a value of x=0 or be lb<=x<=ub. This is a turbine allocation problem, where the turbine can either be turned off (x=0) or be within the lower and upper cavitation limits (lb and ub). Of course I can trick the problem by creating a constraint which will violate for values in between 0 and lb, but I'm finding that the problem is having a hard time converging like this. Is there an easier way to do this, which will trim down the search space?
If the number of variables is small enough (say, like 10 or 15 or less) then you can try every subset of variables that are set to be non-zero, and see which subset gives you the optimal value. If you can't make assumptions about the structure of your optimization problem (e.g. you have penalties for non-zero variables but your main objective function is "exotic"), this is essentially the best that you can do. If you are willing to settle for an approximate solution, you can add a so-called "L1" penalty to your objective function which is the sum of a constant times the absolute values of the variables. This will encourage some variables to be zero, and if your main objective function is convex then the resulting objective function will be convex because negative absolute value is convex. It's much easier to optimize convex functions (taking the minimum) because strictly convex functions always have a global minimum that you can reach using any number of optimization routines (including the ones that are implemented in matlab.)