derivative of a function fitted using a neural network - neural-network

This question is quite general and mathematical.
I'm using a neural network with a single hidden layer (10 neurons) to approximate a function with 2 input and 2 output variables.
The sigmoid function is used as activation function.
Later, I'm using the derivative for something else. I approximate the derivative with a numerical method. This method uses outputs of the neural network.
I was wondering if the derivative of the approximated function can be obtained easily as the learning algorithm uses derivatives for every neuron.
I could get rid of the numerical method for the derivation if there was a simple way to get the derivative out of the neural net.
I'm thinking of the chain rule, but I'm not sure if this is the right way to go and how to use it correctly.

Related

Can a single input single output neural network with y=x as activation function reflect non-linear behavior?

I am currently learning a little bit about neural networks. One question I can't really get behind is about how neural networks reflect non-linear behavior. From my understanding there is no possibility to reflect non-linear behavior inside a compact set using a neural network.
For example if I would take the function from this question:
y = x^2
and I would use a neural network with a single input and single output the best the neural network could do for each compact set [x0...xn] is a linear function spanning from one end of the set to the other, as at the end all calculations inside the net are linear.
Do I have some misunderstanding about this concept?
The ANN's capabilties to model non-linear behaviour arise from the (usually) non-linear activation function.
If the activation function is linear, then the process of training the network is just another way to create a linear (or multi-linear) fit of input and output data.
Activation function in neural networks is exactly the part, that brings non-linearity. If you use linear activation function, then you cannot train non-linear model (thus fit quadratic or other non-linear functions).
The part, I guess, you are interested in is Universal Approximation Theorem, which says that any continuous function can be approximated with a neural network with a single hidden layer (some assumptions on activation function are applied thou). Take into account, that this theorem does not say anything on optimization of such a network (it does not guarantee you can train such a network with a specific algorithm, but only that such a network exists). Also it does not say anything on the number of neurons you should use.
You can check following links, to get more details:
Original proof with sigmoid activation function: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.7873&rep=rep1&type=pdf
And a more friendly derivation: http://mcneela.github.io/machine_learning/2017/03/21/Universal-Approximation-Theorem.html

Activation function for output layer for regression models in Neural Networks

I have been experimenting with neural networks these days. I have come across a general question regarding the activation function to use. This might be a well known fact to but I couldn't understand properly. A lot of the examples and papers I have seen are working on classification problems and they either use sigmoid (in binary case) or softmax (in multi-class case) as the activation function in the out put layer and it makes sense. But I haven't seen any activation function used in the output layer of a regression model.
So my question is that is it by choice we don't use any activation function in the output layer of a regression model as we don't want the activation function to limit or put restrictions on the value. The output value can be any number and as big as thousands so the activation function like sigmoid to tanh won't make sense. Or is there any other reason? Or we actually can use some activation function which are made for these kind of problems?
for linear regression type of problem, you can simply create the Output layer without any activation function as we are interested in numerical values without any transformation.
more info :
https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
for classification :
You can use sigmoid, tanh, Softmax etc.
If you have, say, a Sigmoid as an activation function in output layer of your NN you will never get any value less than 0 and greater than 1.
Basically if the data your're trying to predict are distributed within that range you might approach with a Sigmoid function and test if your prediction performs well on your training set.
Even more general, when predict a data you should come up with the function that represents your data in the most effective way.
Hence if your real data does not fit Sigmoid function well you have to think of any other function (e.g. some polynomial function, or periodic function or any other or a combination of them) but you also should always care of how easily you will build your cost function and evaluate derivatives.
Just use a linear activation function without limiting the output value range unless you have some reasonable assumption about it.

Step function versus Sigmoid function

I don't quite understand why a sigmoid function is seen as more useful (for neural networks) than a step function... hoping someone can explain this for me. Thanks in advance.
The (Heaviside) step function is typically only useful within single-layer perceptrons, an early type of neural networks that can be used for classification in cases where the input data is linearly separable.
However, multi-layer neural networks or multi-layer perceptrons are of more interest because they are general function approximators and they are able to distinguish data that is not linearly separable.
Multi-layer perceptrons are trained using backpropapagation. A requirement for backpropagation is a differentiable activation function. That's because backpropagation uses gradient descent on this function to update the network weights.
The Heaviside step function is non-differentiable at x = 0 and its derivative is 0 elsewhere. This means gradient descent won't be able to make progress in updating the weights and backpropagation will fail.
The sigmoid or logistic function does not have this shortcoming and this explains its usefulness as an activation function within the field of neural networks.
It depends on the problem you are dealing with. In case of simple binary classification, a step function is appropriate. Sigmoids can be useful when building more biologically realistic networks by introducing noise or uncertainty. Another but compeletely different use of sigmoids is for numerical continuation, i.e. when doing bifurcation analysis with respect to some parameter in the model. Numerical continuation is easier with smooth systems (and very tricky with non-smooth ones).

How to calculate the derivate of activation functions for artificial neural nets

When using the sigmoid activation function I understand that the derivative is calculated by output*(1-output). But how is this determined? How do I get from the sigmoid function 1/(1+e^(-x)) to determining that the derivative should be output*(1-output)?
For example if I want to determine the derivative of atan(x) or atan(x) with output scaled to the range 0-1 (atan(x)*0.3183098861837907+0.5), how do I determine this derivative for use in training the neural net?
Well it seems to me like this is more of a maths related question than a coding one, but here you go anyways.
For the sigmoid function:
where
If you compute its derivative:
and
Thus:
Remember, x is the input, and f is the output. Which is why you get your "output*(1-output)"
For other activation functions, you'll just have to compute the derivative first and then code it. Usually though, it won't have a nice form like the one above.
For the other part of your question, what you have is something of this form:
If you compute its derivative (and this will work for any function u(x) that is scaled and offset), you get:
Put simply, the b part is a constant so it disappears when derived and the a is a constant coefficient so it remains unchanged when derived.
In your case, since:
the derivative you're looking for is:
On a personal note, this is pretty simple maths and I would strongly suggest you focus on understanding these before you start using neural networks ;)
Cheers

Does it make sense to use an "activation function cocktail" for approximating an unknown function through a feed-forward neural network?

I just started playing around with neural networks and, as I would expect, in order to train a neural network effectively there must be some relation between the function to approximate and activation function.
For instance, I had good results using sin(x) as an activation function when approximating cos(x), or two tanh(x) to approximate a gaussian. Now, to approximate a function about which I know nothing I am planning to use a cocktail of activation functions, for instance a hidden layer with some sin, some tanh and a logistic function. In your opinion does this make sens?
Thank you,
Tunnuz
While it is true that different activation functions have different merits (mainly for either biological plausibility or a unique network design like radial basis function networks), in general you be able to use any continuous squashing function and expect to be able to approximate most functions encountered in real world training data.
The two most popular choices are the hyperbolic tangent and the logistic function, since they both have easily calculable derivatives and interesting behavior around the axis.
If neither if those allows you to accurately approximate your function, my first response wouldn't be to change activation functions. Rather, you should first investigate your training set and network training parameters (learning rates, number of units in each pool, weight decay, momentum, etc.).
If your still stuck, step back and make sure your using the right architecture (feed forward vs. simple recurrent vs. full recurrent) and learning algorithm (back-propagation vs. back-prop through time vs. contrastive hebbian vs. evolutionary/global methods).
One side note: Make sure you never use a linear activation function (except for output layers or crazy simple tasks), as these have very well documented limitations, namely the need for linear separability.