I am a beginner in Deep Learning. I came through the concept of 'Gradient Checking'.
I just want to know, what is it and how it could help to improve the training process?
Why do we need Gradient Checking?
Back prop as an algorithm has a lot of details and can be a little bit tricky to implement. And one unfortunate property is that there are many ways to have subtle bugs in back prop. So that if you run it with gradient descent or some other optimizational algorithm, it could actually look like it's working. And your cost function, J of theta may end up decreasing on every iteration of gradient descent. But this could prove true even though there might be some bug in your implementation of back prop. So that it looks J of theta is decreasing, but you might just wind up with a neural network that has a higher level of error than you would with a bug free implementation. And you might just not know that there was this subtle bug that was giving you worse performance. So, what can we do about this? There's an idea called gradient checking that eliminates almost all of these problems.
What is Gradient Checking?
We describe a method for numerically checking the derivatives computed by your code to make sure that your implementation is correct. Carrying out the derivative checking procedure significantly increase your confidence in the correctness of your code.
If I have to say in short than Gradient Checking is kind of debugging your back prop algorithm. Gradient Checking basically carry out the derivative checking procedure.
How to implement Gradient Checking?
You can find this procedure here.
Related
I have a question regarding the results we are obtaining from ODE solvers. I will try my best to briefly explain the question I have with me. For an example if we ran a simulation with ANSYS or may be any other FEA package, before we conclude our results there are many parameters to check the quality of the final results we obtained.
But in a numerical simulation, we are the one who gives relTol,absTol and other parameter values to improve the accuracy of the calculation to the solver. For an example if we select solve_ivp which is highly customisable solver available with SciPy.
Q1).How exactly make sure, the results of the solver is acceptable ?.
Q2). What are the ways we can check the quality of the final results we obtained?, before we make a conclusion based on the results obtained.
Q3) How further improve the accuracy of the by changing solver options?.
Highly appreciate if you can share your ideas with sample codings.
IMO, Q1 and Q2 are the same question. The reliability of the results will depend on the accuracy of the mathematical model wrt to the simulated phenomenon (f.i. assuming linearity when linearity is questionable) and the precision of the algorithm. You need to check if the method converges, and if it converges, must converge to a correct solution.
Ideally, you should compare your results to "ground truth" on typical problems. Ground truth can be obtained from a lab experiment, or by using an alternative method known to yield correct results. Without this, you will never be sure that your numerical method is valid, other than by an act of faith.
To understand the effect of the parameters and address Q3, you can solve the same problem with different parameter settings and observe their effect, one by one. After a while, you should get a better understanding on the convergence properties in relation to the parameter settings.
As I usually do when isolated at home for too long, I was thinking about back-propagation.
If my thought process is correct, for computing the weights update we never actually need to compute the cost. We only ever need to compute the derivative of the cost.
Is this correct?
I imagine that the only reason to compute the Cost would be to check if the network is actually learning.
I really believe I am correct, but by checking on the internet no one seems to make this observation. So maybe I am wrong. If I am, I have a deep misunderstanding of backpropagation that I need to fix.
You are correct.
The cost function is what tells you how much the solution costs. The gradient is what carries the information about how to make it cost less.
You could shift the cost with any constant addition or subtraction and it wouldn't make a difference, because there is no way to make that part of the cost go down.
Yes. Back propagation (auto-differentiation) needs gradients, not the loss. Once the forward path is formulated, then all we need to formulate the gradients are available.
Another justification is that the back propagation formula is the chain rule in which there is no loss value.
I really believe I am correct, but by checking on the internet no one seems to make this observation.
Indeed. NN articles or textbook always talk about Loss but not clear that all we need for back-propagation are gradients in the chain rule by which we can do gradient descents.
The QP problem is convex. For Wiki, the problem can be solved in polynomial time.
But what exactly is the order?
That is an interesting question with (in my opinion) no clear answer. I am going to assume your problem is convex and you are interested in run-time complexity (as opposed to Iteration complexity).
As you may know, QuadProg is not one algorithm but rather, a generic name for something that solves Quadratic problems. It uses a set of algorithms underneath viz. Interior Point (Default), Trust-Region and Active-Set. Source.
Depending upon what you choose, each of these algorithms will have its own complexity analysis. For Trust-Region and Active-Set methods, the complexity analysis is extremely hard. In fact, Active-Set methods are not polynomial to begin with. Counterexamples exist where Active-Set methods take exponential "time" to converge (This is true also for the Simplex Method for Linear Programs). Source.
Now, assuming that you choose Interior Point methods, the answer is still not straightforward because there are various flavours of these methods. When Karmarkar first proposed this method, it was the first known polynomial algorithm for solving Linear Programs and it had a complexity of O(n^3.5). Source. These bounds were improved quite a lot later. However, this is for Linear Programs.
Finally, to answer your question, Ye and Tse proved in 1989 that we can have an Interior Point method with complexity O(n^3). However, whether MATLAB uses this exact flavor of Interior Point method is a little tricky to know but O(n^3) would be my best guess.
Of course, my answer is rather theoretical; if you want to empirically test it out, you can do so by gradually increasing the number of variables and plotting the CPU time required to get an estimate.
I have a program using PSO algorithm using penalty function for Constraint Satisfaction. But when I run the program for different iterations, the output of the algorithm would be :
"Iteration 1: Best Cost = Inf"
.
Does anyone know why I always get inf answer?
There could be many reasons for that, none of which will be accurate if you don't provide a MWE with the code you have already tried or a context of the function you are analysing.
For instance, while studying the PSO algorithm you might use it on functions which have analytical solutions first. By doing this you can study the behaviour of the algorithm before applying to a similar problem, and fine tune its parameters.
My guess is that you might not be providing either the right function (I have done that already, getting a signal wrong is easy!), the right constraints (same logic applies), your weights for the penalty function and velocity update are way off.
Are there any faster and more efficient solvers other than fmincon? I'm using fmincon for a specific problem and I run out of memory for modest sized vector variable. I don't have any supercomputers or cloud computing options at my disposal, either. I know that any alternate solution will still run out of memory but I'm just trying to see where the problem is.
P.S. I don't want a solution that would change the way I'm approaching the actual problem. I know convex optimization is the way to go and I have already done enough work to get up until here.
P.P.S I saw the other question regarding the open source alternatives. That's not what I'm looking for. I'm looking for more efficient ones, if someone faced the same problem adn shifted to a better solver.
Hmmm...
Without further information, I'd guess that fmincon runs out of memory because it needs the Hessian (which, given that your decision variable is 10^4, will be 10^4 x numel(f(x1,x2,x3,....)) large).
It also takes a lot of time to determine the values of the Hessian, because fmincon normally uses finite differences for that if you don't specify derivatives explicitly.
There's a couple of things you can do to speed things up here.
If you know beforehand that there will be a lot of zeros in your Hessian, you can pass sparsity patterns of the Hessian matrix via HessPattern. This saves a lot of memory and computation time.
If it is fairly easy to come up with explicit formulae for the Hessian of your objective function, create a function that computes the Hessian and pass it on to fmincon via the HessFcn option in optimset.
The same holds for the gradients. The GradConstr (for your non-linear constraint functions) and/or GradObj (for your objective function) apply here.
There's probably a few options I forgot here, that could also help you. Just go through all the options in the optimization toolbox' optimset and see if they could help you.
If all this doesn't help, you'll really have to switch optimizers. Given that fmincon is the pride and joy of MATLAB's optimization toolbox, there really isn't anything much better readily available, and you'll have to search elsewhere.
TOMLAB is a very good commercial solution for MATLAB. If you don't mind going to C or C++...There's SNOPT (which is what TOMLAB/SNOPT is based on). And there's a bunch of things you could try in the GSL (although I haven't seen anything quite as advanced as SNOPT in there...).
I don't know on what version of MATLAB you have, but I know for a fact that in R2009b (and possibly also later), fmincon has a few real weaknesses for certain types of problems. I know this very well, because I once lost a very prestigious competition (the GTOC) because of it. Our approach turned out to be exactly the same as that of the winners, except that they had access to SNOPT which made their few-million variable optimization problem converge in a couple of iterations, whereas fmincon could not be brought to converge at all, whatever we tried (and trust me, WE TRIED). To this day I still don't know exactly why this happens, but I verified it myself when I had access to SNOPT. Once, when I have an infinite amount of time, I'll find this out and report this to the MathWorks. But until then...I lost a bit of trust in fmincon :)