I have an unknown function, say, F(x), which I use a back-propagation neural network to approximate. Surely this can be done, as it is in the standard repertoire of neural networks.
F(x) does not explicitly exist. It is learned from a training set of data points.
Say, the NN learns a function G(x) which approximates F(x).
AFTER the learning of G is finished, I want to find the global maximum value of G(x), and the position of x when that occurs.
Given that G is implicitly realized by the NN, I don't have the explicit form of G.
Is there any quick algorithm that allows me to find arg-max(x) of G(x) ?
Neural networks would give rise to discontinuous functions (in general) since it consists of a network of discontinuous functions (a neuron fires at a certain threshold which is a jump discontinuity). But -- if in you application it makes sense to think of G(x) as (approximately) continuous or even differentiable, you could use hill-climbing techniques where you start at a random point, estimate the derivative (or gradient if x is a vector rather than a scalar) and move in the direction of steepest increase a short step, repeating the process until no more improvement is found. This gives you an approximate local maximum. You can repeat the process with different randomly starting values. If you always get the same result then you can be reasonably confident (though not certain) that it is in fact a global max.
Without any assumptions on G(x) it is hard to say anything definite. If x is chosen randomly then G(x) is a random variable. You can use statistical methods to estimate e.g. its 99th percentile. You could also try using an evolutionary algorithm in which G(x) plays the role of a fitness function.
Related
I understand that if a number gets closer to zero than realmin, then Matlab converts the double to a denorm . I am noticing this causes significant performance cost. In particular I am using a gradient descent algorithm that when near convergence, the gradients (in backprop for my bespoke neural network) drop below realmin such that the algorithm incurs heavy performance cost (due to, I am assuming, type conversion behind the scenes). I have used the following code to validate my gradient matrices so that no numbers falls below realmin:
function mat= validateSmallDoubles(obj, mat, threshold)
mat= mat.*(abs(mat)>threshold);
end
Is this usual practice and what value should threshold take (obviously you want this as close to realmin as possible, but not too close otherwise any additional division operations will send some elements of mat below realmin after validation)?. Also, specifically for neural networks, where are the best places to do gradient validation without ruining the network's ability to learn?. I would be grateful to know what solutions people with experience in training neural networks have? I am sure this is a problem for all languages. Tentative threshold values have ruined my network's learning.
I do not know if it is somehow related to your problem, but I had a similar problem with underflows while doing exponentially weighted average of gradients (say while implementing Momentum or Adam).
In particular, at some point you do something like:
v := 0.9*v + 0.1*gradient where v is the exponentially weighted average of your gradient g. If in a lot of successive iterations a same element of your g matrix remains 0, your v is quickly becoming very small and you hit dernormals.
So the problem, is why all those zeros ? In my case the culprit where the ReLu units which outputed a lot of zeros (if x<0 , relu(x) is zero). Because when Relu outputs zero on a given neurons the related weight has no effect it means the corresponding partial derivative will be zero in g. So it happened to me that in a lot of successive iterations that particular neuron was not fired.
To avoiding having zero activations (and derivatives), I used "leaky relu" so to have a very small derivative instead.
Another solution, is to use gradient clipping before applying your weighted average to threshold your gradients to a minimum value. Which is quite similar to what you did.
I traced the diminishing gradient occurrences to the Adam SGD optimiser - the biased moving average matrix calculations in the Adam optimiser were causing matlab to carry out the denorm operation. I simply thresholded the matrix elements for each layer after these calculations, with threshold=10*realmin, to zero without any effect on learning. I have yet to investigate why my moving averages were getting so close to zero as my architecture and weight initialisation priors would normally mitigate this.
What is the concept behind taking the derivative? It's interesting that for somehow teaching a system, we have to adjust its weights. But why are we doing this using a derivation of the transfer function. What is in derivation that helps us. I know derivation is the slope of a continuous function at a given point, but what does it have to do with the problem.
You must already know that the cost function is a function with the weights as the variables.
For now consider it as f(W).
Our main motive here is to find a W for which we get the minimum value for f(W).
One of the ways for doing this is to plot function f in one axis and W in another....... but remember that here W is not just a single variable but a collection of variables.
So what can be the other way?
It can be as simple as changing values of W and see if we get a lower value or not than the previous value of W.
But taking random values for all the variables in W can be a tedious task.
So what we do is, we first take random values for W and see the output of f(W) and the slope at all the values of each variable(we get this by partially differentiating the function with the i'th variable and putting the value of the i'th variable).
now once we know the slope at that point in space we move a little further towards the lower side in the slope (this little factor is termed alpha in gradient descent) and this goes on until the slope gives a opposite value stating we already reached the lowest point in the graph(graph with n dimensions, function vs W, W being a collection of n variables).
The reason is that we are trying to minimize the loss. Specifically, we do this by a gradient descent method. It basically means that from our current point in the parameter space (determined by the complete set of current weights), we want to go in a direction which will decrease the loss function. Visualize standing on a hillside and walking down the direction where the slope is steepest.
Mathematically, the direction that gives you the steepest descent from your current point in parameter space is the negative gradient. And the gradient is nothing but the vector made up of all the derivatives of the loss function with respect to each single parameter.
Backpropagation is an application of the Chain Rule to neural networks. If the forward pass involves applying a transfer function, the gradient of the loss function with respect to the weights will include the derivative of the transfer function, since the derivative of f(g(x)) is f’(g(x))g’(x).
Your question is a really good one! Why should I move the weight more in one direction when the slope of the error wrt. the weight is high? Does that really make sense? In fact it does makes sense if the error function wrt. the weight is a parabola. However it is a wild guess to assume it is a parabola. As rcpinto says, assuming the error function is a parabola, make the derivation of the a updates simple with the Chain Rule.
However, there are some other parameter update rules that actually addresses this, non-intuitive assumption. You can make update rule that takes the weight a fixed size step in the down-slope direction, and then maybe later in the training decrease the step size logarithmic as you train. (I'm not sure if this method has a formal name.)
There are also som alternative error function that can be used. Look up Cross Entropy in you neural network text book. This is an adjustment to the error function such that the derivative (of the transfer function) factor in the update rule cancels out. Just remember to pick the right cross entropy function based on you output transfer function.
When I first started getting into Neural Nets, I had this question too.
The other answers here have explained the math which makes it pretty clear that a derivative term will appear in your calculations while you are trying to update the weights. But all of those calculations are being done in order to implement Back-propagation, which is just one of the ways of updating weights! Now read on...
You are correct in assuming that at the end of the day, all a neural network tries to do is update its weights to fit the data you feed into it. Within this statement lies your answer too. What you are getting confused with here is the idea of the Back-propagation algorithm. Many textbooks use backprop to update neural nets by default but do not mention that there are other ways to update weights too. This leads to the confusion that neural nets and backprop are the same thing and are inherently connected. This also leads to the false belief that neural nets need backprop to train.
Please remember that Back-propagation is just ONE of the ways out there to train your neural network (although it is the most famous one). Now, you must have seen the math involved in backprop, and hence you can see where the derivative term comes in from (some other answers have also explained that). It is possible that other training methods won't need the derivatives, although most of them do. Read on to find out why....
Think about this intuitively, we are talking about CHANGING weights, the direct mathematical operation related to change is a derivative, makes sense that you should need to evaluate derivatives to change weights.
Do let me know if you are still confused and I'll try to modify my answer to make it better. Just as a parting piece of information, another common misconception is that gradient descent is a part of backprop, just like it is assumed that backprop is a part of neural nets. Gradient descent is just one way to minimize your cost function, there are plenty of others you can use. One of the answers above makes this wrong assumption too when it says "Specifically Gradient Descent". This is factually incorrect. :)
Training a neural network means minimizing an associated "error" function wrt the networks weights. Now there are optimization methods that use only function values (Simplex method of Nelder and Mead, Hooke and Jeeves, etc), methods that in addition use first derivatives (steepest descend, quasi Newton, conjugate gradient) and Newton methods using second derivatives as well. So if you want to use a derivative method, you have to calculate the derivatives of the error function, which in return involves the derivatives of the transfer or activation function.
Back propagation is just a nice algorithm to calculate the derivatives, and nothing more.
Yes, the question was really good, this question was also came in my head while i am understanding the Backpropagation. After doing ForwordPropagation on neural network we do back propagation in network to minimize the total error. And there also many other way to minimize the error.your question is why we are doing derivative in backpropagation, the reason is that, As we all know the meaning of derivative is to find the slope of a function or in other words we can find change of particular thing with respect to particular thing. So here we are doing derivative to minimize the total error with respect to the corresponding weights of the network.
and here by doing the derivation of total error with respect to weights we can find it's slope or in other words we can find what is the change in total error with respect to the small change of the weight, so that we can update the weight to minimize the error with the help of this Gradient Descent formula, that is, Weight= weight-Alpha*(del(Total error)/del(weight)).Or in other words New Weights = Old Weights - learning-rate x Partial derivatives of loss function w.r.t. parameters.
Here Alpha is the learning rate which is control the weight update, means if the derivative the - ve than Alpha make it +ve(Becouse of -Alpha in formula) and if +ve it's remain +ve so that weight update goes in +ve direction and it's reflected to minimize the Total error.And also the as derivative part is multiples with Alpha, it's decrees the step size of Alpha when the weight converge to the optimal value of weight(minimum error). Thats why we are doing derivative to minimize the error.
When using the sigmoid activation function I understand that the derivative is calculated by output*(1-output). But how is this determined? How do I get from the sigmoid function 1/(1+e^(-x)) to determining that the derivative should be output*(1-output)?
For example if I want to determine the derivative of atan(x) or atan(x) with output scaled to the range 0-1 (atan(x)*0.3183098861837907+0.5), how do I determine this derivative for use in training the neural net?
Well it seems to me like this is more of a maths related question than a coding one, but here you go anyways.
For the sigmoid function:
where
If you compute its derivative:
and
Thus:
Remember, x is the input, and f is the output. Which is why you get your "output*(1-output)"
For other activation functions, you'll just have to compute the derivative first and then code it. Usually though, it won't have a nice form like the one above.
For the other part of your question, what you have is something of this form:
If you compute its derivative (and this will work for any function u(x) that is scaled and offset), you get:
Put simply, the b part is a constant so it disappears when derived and the a is a constant coefficient so it remains unchanged when derived.
In your case, since:
the derivative you're looking for is:
On a personal note, this is pretty simple maths and I would strongly suggest you focus on understanding these before you start using neural networks ;)
Cheers
Would you please provide me with some cost function that i can use in Neural Network back propagation prediction.
I have a prediction to be done in backpropagation, but i dont know if i can use any cost function.
Are cost function dependent on the activation function that we use?
If i use sin(x) as activation function then what will be the cost function?
1) In the course Machine Learning on Cursera, Andrew Ng provides this cost function: J(theta) = -1/m*( y*log(h(X*theta)) + (1-y)*log(1-h(X*theta)) ),
where log is a natural logarithm, h(X*theta) is output from NN. Apparently, this comes from Kullback–Leibler divergence and Cross-entropy.
2) The general idea is that your cost function is convex and classifier via optimization function finds one global minimum. If you use the above equation for classification such that your ys are 0 or 1, this function will be convex.
As a side note, the cost function J(theta) = 1/2*( h(x) - y )^2 is not convex when you use NN for classification.
3) Yes, especially for gradient calculation whose form is different depending on an activation function.
4) First of all why would you invent the wheel if it's already invented?
It's tricky question but I can give some clues. Your Cost function has to
give ideally 0 when output is exactly as it supposed to be:
h(x)=1 given x and y is 1
h(x)=0 given x and y is 0
The cost function has to give you any huge number otherwise, e.g. h(x)=0 when y=1 etc.
The cost function has to be convex.
Taking into account this features, you would have to work out the Cost Function. Besides, it looks like if you used sin(x), you would probably have to limit your argument X*theta to -pi ... pi.
Actually, ZikO, you should check the coursera materials again. The first cost function is still non-convex. Perhaps you are confusing neural networks with logistic regression. In the latter, using the entropic cost-function does indeed make the problem convex, whereas ordinary least squares does not. In neural networks, both cost functions are non-convex. According to Prof. Ng, however, this is not a big problem for neural networks.
I'm looking for a function that generates significant errors in numerical integration using Gaussian quadrature or Simpson quadrature.
Since Simpson's and Gaussian's methods are trying to fit a supposedly smooth function with pieces of simple smooth functions, such as 2nd-order polynomials, and otherwise make use of low-order polynomials and other simple algebraic functions such as $$a+5/6$$, it makes sense that the biggest challenges would be functions that aren't 2nd order polynomials or resembling those simple functions.
Step functions, or more generally functions that are constant for short runs then jump to another value. A staircase, or the Walsh functions (used for a kind of binary Fourier transform) should be interesting. Just a plain simple single step does not fit any polynomial approximation very well.
Try a high-order polynomial. Just x^n for a large n should be interesting. Maybe subtract x^n - x^(n-1) for some large n. How large is "large"? For Simpson, perhaps 4 or more. For Gaussian using k points, n>k. (Don't go nuts trying n beyond modest two digit numbers; that just becomes nasty calculation apart from any integration.)
Few numerical integration methods like poles, that is, functions resembling 1/(x-a) for some neighborhood around a. Since it may be trouble to deal with actual infinity, try pushing it off the real line, or a complex conjugate pair. Make a big but finite spike using 1/( (x-a)^2 + b) where b>0 is small. Or the square root of that expression, or the sine or exponential of it. You could replace the "2" with a bigger power, I bet that'll be nasty.
Once upon a time I wanted to test a numerical integration routine. I started with a stairstep function, or train of rectangular pulses, sampled on some set of points.
I computed an approximate derivative using a Savitzky-Golay filter. SG can differentiate numerical data using a finite window of neighboring points, though normally it's used for smoothing. It takes a window size (number of points), polynomial order (2 or 4 in practice, but you may want to go nuts with higher), and differentiation order (normally 0 to smooth, 1 to get derivatives).
The result was a series of pulses, which I then integrated. A good routine will recreate the original stairstep or rectangular pulses. I imagine if the SG parameters are chosen right, you will make Simpson and Gauss roll over in their graves.
If you are looking for a difficult function to integrate as a test method, you could consider the one in the CS Stack Exchange question:
Method for numerical integration of difficult oscillatory integral
In this question, one of the answers suggests using the chebfun library for Matlab, which contains an implementation of a basic Levin-type method. This suggests to me that the function would fail using a simpler method such as Simpsons rule.