MSE Cost Function for Training Neural Network - neural-network

In an online textbook on neural networks and deep learning, the author illustrates neural net basics in terms of minimizing a quadratic cost function which he says is synonymous with mean squared error. Two things have me confused about his function, though (pseudocode below).
MSE≡(1/2n)*∑‖y_true-y_pred‖^2
Instead of dividing the sum of squared errors by the number of training examples n why is it instead divided by 2n? How is this the mean of anything?
Why is double bar notation used instead of parentheses? This had me thinking there was some other calculation going on, such as of an L2-norm, that is not shown explicitly. I suspect this is not the case and that term is meant to express plain old sum of squared errors. Super confusing though.
Any insight you can offer is greatly appreciated!

The 0.5 factor by which the cost function is multiplied is not important. In fact you could multiply it by any real constant you want, and the learning would be the same. It's only used so that the derivative of the cost function with respect to the output will simply be $$y - y_{t}$$. Which is convenient in some applications, like backpropagation.

The notation ∥v∥ just denotes the usual length function for a vector v. From the online textbook you referenced.
Find more info on the double bars here. But from what I understand, you can basically view it as an absolute term.
I'm not sure why it says 2n, but it's not always 2n. Wikipedia for example writes the function as follows:
Googling Mean Squared Error also has a lot of sources using the Wikipedia one, instead of theo ne from the online textbook.

The double bar is the distance measure, and the bracket is incorrect if y is multi-dimenssional.
For mean squared error, there is no 2 with n, but it is unimportant. It will be absorbed by the learning rate.
However it is often there to cancel the square number 2 when evaluating the derivative.

Related

Why do we take the derivative of the transfer function in calculating back propagation algorithm?

What is the concept behind taking the derivative? It's interesting that for somehow teaching a system, we have to adjust its weights. But why are we doing this using a derivation of the transfer function. What is in derivation that helps us. I know derivation is the slope of a continuous function at a given point, but what does it have to do with the problem.
You must already know that the cost function is a function with the weights as the variables.
For now consider it as f(W).
Our main motive here is to find a W for which we get the minimum value for f(W).
One of the ways for doing this is to plot function f in one axis and W in another....... but remember that here W is not just a single variable but a collection of variables.
So what can be the other way?
It can be as simple as changing values of W and see if we get a lower value or not than the previous value of W.
But taking random values for all the variables in W can be a tedious task.
So what we do is, we first take random values for W and see the output of f(W) and the slope at all the values of each variable(we get this by partially differentiating the function with the i'th variable and putting the value of the i'th variable).
now once we know the slope at that point in space we move a little further towards the lower side in the slope (this little factor is termed alpha in gradient descent) and this goes on until the slope gives a opposite value stating we already reached the lowest point in the graph(graph with n dimensions, function vs W, W being a collection of n variables).
The reason is that we are trying to minimize the loss. Specifically, we do this by a gradient descent method. It basically means that from our current point in the parameter space (determined by the complete set of current weights), we want to go in a direction which will decrease the loss function. Visualize standing on a hillside and walking down the direction where the slope is steepest.
Mathematically, the direction that gives you the steepest descent from your current point in parameter space is the negative gradient. And the gradient is nothing but the vector made up of all the derivatives of the loss function with respect to each single parameter.
Backpropagation is an application of the Chain Rule to neural networks. If the forward pass involves applying a transfer function, the gradient of the loss function with respect to the weights will include the derivative of the transfer function, since the derivative of f(g(x)) is f’(g(x))g’(x).
Your question is a really good one! Why should I move the weight more in one direction when the slope of the error wrt. the weight is high? Does that really make sense? In fact it does makes sense if the error function wrt. the weight is a parabola. However it is a wild guess to assume it is a parabola. As rcpinto says, assuming the error function is a parabola, make the derivation of the a updates simple with the Chain Rule.
However, there are some other parameter update rules that actually addresses this, non-intuitive assumption. You can make update rule that takes the weight a fixed size step in the down-slope direction, and then maybe later in the training decrease the step size logarithmic as you train. (I'm not sure if this method has a formal name.)
There are also som alternative error function that can be used. Look up Cross Entropy in you neural network text book. This is an adjustment to the error function such that the derivative (of the transfer function) factor in the update rule cancels out. Just remember to pick the right cross entropy function based on you output transfer function.
When I first started getting into Neural Nets, I had this question too.
The other answers here have explained the math which makes it pretty clear that a derivative term will appear in your calculations while you are trying to update the weights. But all of those calculations are being done in order to implement Back-propagation, which is just one of the ways of updating weights! Now read on...
You are correct in assuming that at the end of the day, all a neural network tries to do is update its weights to fit the data you feed into it. Within this statement lies your answer too. What you are getting confused with here is the idea of the Back-propagation algorithm. Many textbooks use backprop to update neural nets by default but do not mention that there are other ways to update weights too. This leads to the confusion that neural nets and backprop are the same thing and are inherently connected. This also leads to the false belief that neural nets need backprop to train.
Please remember that Back-propagation is just ONE of the ways out there to train your neural network (although it is the most famous one). Now, you must have seen the math involved in backprop, and hence you can see where the derivative term comes in from (some other answers have also explained that). It is possible that other training methods won't need the derivatives, although most of them do. Read on to find out why....
Think about this intuitively, we are talking about CHANGING weights, the direct mathematical operation related to change is a derivative, makes sense that you should need to evaluate derivatives to change weights.
Do let me know if you are still confused and I'll try to modify my answer to make it better. Just as a parting piece of information, another common misconception is that gradient descent is a part of backprop, just like it is assumed that backprop is a part of neural nets. Gradient descent is just one way to minimize your cost function, there are plenty of others you can use. One of the answers above makes this wrong assumption too when it says "Specifically Gradient Descent". This is factually incorrect. :)
Training a neural network means minimizing an associated "error" function wrt the networks weights. Now there are optimization methods that use only function values (Simplex method of Nelder and Mead, Hooke and Jeeves, etc), methods that in addition use first derivatives (steepest descend, quasi Newton, conjugate gradient) and Newton methods using second derivatives as well. So if you want to use a derivative method, you have to calculate the derivatives of the error function, which in return involves the derivatives of the transfer or activation function.
Back propagation is just a nice algorithm to calculate the derivatives, and nothing more.
Yes, the question was really good, this question was also came in my head while i am understanding the Backpropagation. After doing ForwordPropagation on neural network we do back propagation in network to minimize the total error. And there also many other way to minimize the error.your question is why we are doing derivative in backpropagation, the reason is that, As we all know the meaning of derivative is to find the slope of a function or in other words we can find change of particular thing with respect to particular thing. So here we are doing derivative to minimize the total error with respect to the corresponding weights of the network.
and here by doing the derivation of total error with respect to weights we can find it's slope or in other words we can find what is the change in total error with respect to the small change of the weight, so that we can update the weight to minimize the error with the help of this Gradient Descent formula, that is, Weight= weight-Alpha*(del(Total error)/del(weight)).Or in other words New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters.
Here Alpha is the learning rate which is control the weight update, means if the derivative the - ve than Alpha make it +ve(Becouse of -Alpha in formula) and if +ve it's remain +ve so that weight update goes in +ve direction and it's reflected to minimize the Total error.And also the as derivative part is multiples with Alpha, it's decrees the step size of Alpha when the weight converge to the optimal value of weight(minimum error). Thats why we are doing derivative to minimize the error.

SVM in Matlab: Meaning of Parameter 'box constraint' in function fitcsvm

I'm new to SVMs in Matlab and need a little bit of help with it.
I want to train a support vector machine using the build in function fitcsvm of the Statistics Toolbox.
Of course there are many parameter choices which control how the SVM will be trained.
The Matlab help is a litte bit wage about how the parameters archive a better training result. Especially the parameter 'Box Contraint' seems to have an important influence on the number of chosen support vectors and generalization quality.
The Help (http://de.mathworks.com/help/stats/fitcsvm.html#bt8v_z4-1) says
A parameter that controls the maximum penalty imposed on margin-violating observations, and aids in preventing overfitting (regularization).
If you increase the box constraint, then the SVM classifier assigns fewer support vectors. However, increasing the box constraint can lead to longer training times.
How exactly is this parameter used?
Is it the same or something like the soft margin factor C in the Wikipedia reference?
Or something completely different?
Thanks for your help.
You were definitely on the right path. While description in the documentation of fitcsvm (as you posted in the question) is very short, you should have a look at the Understanding Support Vector Machines site in the MATLAB documentation.
In the non-separable case (often called Soft-Margin SVM), one allows misclassifications, at the cost of a penalty factor C. The mathematical formulation of the SVM then becomes:
with the slack variables s_i which cause a penalty term which is weighted by C.
Making C large increases the weight of misclassifications, which leads to a stricter separation.
This factor C is called box constraint.
The reason for this name is, that in the formulation of the dual optimization problem, the Langrange multipliers are bounded to be within the range [0,C].
C thus poses a box constraint on the Lagrange multipliers.
tl;dr your guess was right, it is the C in the soft margin SVM.

about backpropagation and sigmoid function

I have been reading this ebook about ANN:https://www4.rgu.ac.uk/files/chapter3%20-%20bp.pdf
and got a doubt about the effect of the sigmoid function for calculating the errorB. In the text says that if I have threshold neuron I can use:
Target-Output
but because I have a sigmoid function involved I should add:
Output(1-Output)
and end up with:
ErrorB=OutputB(1-OutputB)(TargetB-OutputB)
I mean why I should add the part of O(1-O), I have tried with different values, but I really do not get the intuition why it should be in that way.
Any help?
Thanks
As Kelu stated, that part of the equation is based on derivatives of your transfer function (in this case sigmoid). To understand why you need derivatives, you need to understand how the delta rule works(*):
Your overall goal is to minimize the error in the network's output using gradient descent. Gradient descent itself tries to find a minimum in the error function (E) by taking steps proportional to the negative of the gradient. A gradient is simply the derivative and the reason you're working with derivatives mathematically is that gradients point in the direction of the greatest rate of increase of the (error) function. Conclusion: Since you wanna minimize the error, you go the opposite way of the gradient.
This is the intuitive reason for using gradients. If you want the mathematical derivation, you should check this basic wiki article (additional comment as it's not mentioned anywhere: the g'(x) in the article is the first derivative of g(x))
Other transfer functions can be used, e.g. linear (in this case there is no g'(x) term as the derivative is simply a constant) or hyperbolic tangent in which case the derivative is something different again.
(*) Equation is derived from following equation where you start by minimizing the error of the output:
It is like that because of the fact that Output(1-Output) is a derivative of sigmoid function (simplified). In general, this part is based on derivatives, you can try with different functions (from sigmoid) and then you have to use their derivatives too to get a proper learning rate.
If you want you can take a look at my implementation (it's far from perfect, but maybe you will get some idea from it ;)), it's a simple project I made on my university - https://github.com/kelostrada/neuron-network

Naive bayes classifier calculation

I'm trying to use naive Bayes classifier to classify my dataset.My questions are:
1- Usually when we try to calculate the likehood we use the formula:
P(c|x)= P(c|x1) * P(c|x2)*...P(c|xn)*P(c) . But in some examples it says in order to avoid getting very small results we use P(c|x)= exp(log(c|x1) + log(c|x2)+...log(c|xn) + logP(c)). can anyone explain more to me the difference between these two formula and are they both used to calculate the "likehood" or the sec one is used to calculate something called "information gain".
2- In some cases when we try to classify our datasets some joints are null. Some ppl use "LAPLACE smoothing" technique in order to avoid null joints. Doesnt this technique influence on the accurancy of our classification?.
Thanks in advance for all your time. I'm just new to this algorithm and trying to learn more about it. So is there any recommended papers i should read? Thanks alot.
I'll take a stab at your first question, assuming you lost most of the P's in your second equation. I think the equation you are ultimately driving towards is:
log P(c|x) = log P(c|x1) + log P(c|x2) + ... + log P(c)
If so, the examples are pointing out that in many statistical calculations, it's often easier to work with the logarithm of a distribution function, as opposed to the distribution function itself.
Practically speaking, it's related to the fact that many statistical distributions involve an exponential function. For example, you can find where the maximum of a Gaussian distribution K*exp^(-s_0*(x-x_0)^2) occurs by solving the mathematically less complex problem (if we're going through the whole formal process of taking derivatives and finding equation roots) of finding where the maximum of its logarithm K-s_0*(x-x_0)^2 occurs.
This leads to many places where "take the logarithm of both sides" is a standard step in an optimization calculation.
Also, computationally, when you are optimizing likelihood functions that may involve many multiplicative terms, adding logarithms of small floating-point numbers is less likely to cause numerical problems than multiplying small floating point numbers together is.

Looking for ODE integrator/solver with a relaxed attitude to derivative precision

I have a system of (first order) ODEs with fairly expensive to compute derivatives.
However, the derivatives can be computed considerably cheaper to within given error bounds, either because the derivatives are computed from a convergent series and bounds can be placed on the maximum contribution from dropped terms, or through use of precomputed range information stored in kd-tree/octree lookup tables.
Unfortunately, I haven't been able to find any general ODE solvers which can benefit from this; they all seem to just give you coordinates and want an exact result back. (Mind you, I'm no expert on ODEs; I'm familiar with Runge-Kutta, the material in the Numerical Recipies book, LSODE and the Gnu Scientific Library's solver).
ie for all the solvers I've seen, you provide a derivs callback function accepting a t and an array of x, and returning an array of dx/dt back; but ideally I'm looking for one which gives the callback t, xs, and an array of acceptable errors, and receives dx/dt_min and dx/dt_max arrays back, with the derivative range guaranteed to be within the required precision. (There are probably numerous equally useful variations possible).
Any pointers to solvers which are designed with this sort of thing in mind, or alternative approaches to the problem (I can't believe I'm the first person wanting something like this) would be greatly appreciated.
Roughly speaking, if you know f' up to absolute error eps, and integrate from x0 to x1, the error of the integral coming from the error in the derivative is going to be <= eps*(x1 - x0). There is also discretization error, coming from your ODE solver. Consider how big eps*(x1 - x0) can be for you and feed the ODE solver with f' values computed with error <= eps.
I'm not sure this is a well-posed question.
In many algorithms, e.g, nonlinear equation solving, f(x) = 0, an estimate of a derivative f'(x) is all that's required for use in something like Newton's method since you only need to go in the "general direction" of the answer.
However, in this case, the derivative is a primary part of the (ODE) equation you're solving - get the derivative wrong, and you'll just get the wrong answer; it's like trying to solve f(x) = 0 with only an approximation for f(x).
As another answer has suggested, if you set up your ODE as applied f(x) + g(x) where g(x) is an error term, you should be able to relate errors in your derivatives to errors in your inputs.
Having thought about this some more, it occurred to me that interval arithmetic is probably key. My derivs function basically returns intervals. An integrator using interval arithmetic would maintain x's as intervals. All I'm interested in is obtaining a sufficiently small error bound on the xs at a final t. An obvious approach would be to iteratively re-integrate, improving the quality of the sample introducing the most error each iteration until we finally get a result with acceptable bounds (although that sounds like it could be a "cure worse than the disease" with regards to overall efficiency). I suspect adaptive step size control could fit in nicely in such a scheme, with step size chosen to keep the "implicit" discretization error comparable with the "explicit error" ie the interval range).
Anyway, googling "ode solver interval arithmetic" or just "interval ode" turns up a load of interesting new and relevant stuff (VNODE and its references in particular).
If you have a stiff system, you will be using some form of implicit method in which case the derivatives are only used within the Newton iteration. Using an approximate Jacobian will cost you strict quadratic convergence on the Newton iterations, but that is often acceptable. Alternatively (mostly if the system is large) you can use a Jacobian-free Newton-Krylov method to solve the stages, in which case your approximate Jacobian becomes merely a preconditioner and you retain quadratic convergence in the Newton iteration.
Have you looked into using odeset? It allows you to set options for an ODE solver, then you pass the options structure as the fourth argument to whichever solver you call. The error control properties (RelTol, AbsTol, NormControl) may be of most interest to you. Not sure if this is exactly the sort of help you need, but it's the best suggestion I could come up with, having last used the MATLAB ODE functions years ago.
In addition: For the user-defined derivative function, could you just hard-code tolerances into the computation of the derivatives, or do you really need error limits to be passed from the solver?
Not sure I'm contributing much, but in the pharma modeling world, we use LSODE, DVERK, and DGPADM. DVERK is a nice fast simple order 5/6 Runge-Kutta solver. DGPADM is a good matrix-exponent solver. If your ODEs are linear, matrix exponent is best by far. But your problem is a little different.
BTW, the T argument is only in there for generality. I've never seen an actual system that depended on T.
You may be breaking into new theoretical territory. Good luck!
Added: If you're doing orbital simulations, seems to me I heard of special methods used for that, based on conic-section curves.
Check into a finite element method with linear basis functions and midpoint quadrature. Solving the following ODE requires only one evaluation each of f(x), k(x), and b(x) per element:
-k(x)u''(x) + b(x)u'(x) = f(x)
The answer will have pointwise error proportional to the error in your evaluations.
If you need smoother results, you can use quadratic basis functions with 2 evaluation of each of the above functions per element.