On this post I read that ReLU should only be used in hidden layers. Why is this like that?
I have a neural network with a regression task. It outputs a number between 0 and 10. I thought ReLU would be a good choice here since it does'nt return numbers smaller than 0. What would be the best activation function for the output layer here?
Generally, you can still use activation functions for your output layer. I have frequently used Sigmoid activation functions to squash my output in the 0-1 range, and that worked wonderful.
One reason you should consider when using ReLUs is, that they can produce dead neurons. That means that under certain circumstances your network can produce regions in which the network won't update, and the output is always 0.
Essentially, if you have ReLU in your output, you will have no gradient at all, see here for more details.
If you are careful during intialization, I don't see why it shouldn't work, though.
Related
I have been experimenting with neural networks these days. I have come across a general question regarding the activation function to use. This might be a well known fact to but I couldn't understand properly. A lot of the examples and papers I have seen are working on classification problems and they either use sigmoid (in binary case) or softmax (in multi-class case) as the activation function in the out put layer and it makes sense. But I haven't seen any activation function used in the output layer of a regression model.
So my question is that is it by choice we don't use any activation function in the output layer of a regression model as we don't want the activation function to limit or put restrictions on the value. The output value can be any number and as big as thousands so the activation function like sigmoid to tanh won't make sense. Or is there any other reason? Or we actually can use some activation function which are made for these kind of problems?
for linear regression type of problem, you can simply create the Output layer without any activation function as we are interested in numerical values without any transformation.
more info :
https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
for classification :
You can use sigmoid, tanh, Softmax etc.
If you have, say, a Sigmoid as an activation function in output layer of your NN you will never get any value less than 0 and greater than 1.
Basically if the data your're trying to predict are distributed within that range you might approach with a Sigmoid function and test if your prediction performs well on your training set.
Even more general, when predict a data you should come up with the function that represents your data in the most effective way.
Hence if your real data does not fit Sigmoid function well you have to think of any other function (e.g. some polynomial function, or periodic function or any other or a combination of them) but you also should always care of how easily you will build your cost function and evaluate derivatives.
Just use a linear activation function without limiting the output value range unless you have some reasonable assumption about it.
This question already has answers here:
Why use softmax only in the output layer and not in hidden layers?
(5 answers)
Closed 5 years ago.
I have read the answer given here. My exact question pertains to the accepted answer:
Variables independence : a lot of regularization and effort is put to keep your variables independent, uncorrelated and quite sparse. If you use softmax layer as a hidden layer - then you will keep all your nodes (hidden variables) linearly dependent which may result in many problems and poor generalization.
What are the complications that forgoing the variable independence in hidden layers arises? Please provide at least one example. I know hidden variable independence helps a lot in codifying the backpropogation but backpropogation can be codified for softmax as well (Please verify if or not i am correct in this claim. I seem to have gotten the equations right according to me. hence the claim).
Training issue: try to imagine that to make your network working better you have to make a part of activations from your hidden layer a little bit lower. Then - automaticaly you are making rest of them to have mean activation on a higher level which might in fact increase the error and harm your training phase.
I don't understand how you achieve that kind of flexibility even in sigmoid hidden neuron where you can fine tune the activation of a particular given neuron which is precisely what the gradient descent's job is. So why are we even worried about this issue. If you can implement the backprop rest will be taken care of by gradient descent. Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
mathematical issue: by creating constrains on activations of your model you decrease the expressive power of your model without any logical explaination. The strive for having all activations the same is not worth it in my opinion.
Kindly explain what is being said here
Batch normalization: I understand this, No issues here
1/2. I don't think you have a clue of what the author is trying to say. Imagine a layer with 3 nodes. 2 of these nodes have an error responsibility of 0 with respect to the output error; so there is óne node that should be adjusted. So if you want to improve the output of node 0, then you immediately affect nodes 1 and 2 in that layer - possibly making the output even more wrong.
Fine tuning the weights so as to make the activations proper is not something you, even if you could do, which you cant, would want to do. (Kindly correct me if my understanding is wrong here)
That is the definition of backpropagation. That is exactly what you want. Neural networks rely on activations (which are non-linear) to map a function.
3. Your basically saying to every neuron 'hey, your output cannot be higher than x, because some other neuron in this layer already has value y'. Because all neurons in a softmax layer should have a total activation of 1, it means that neurons cannot be higher than a specific value. For small layers - small problem, but for big layers - big problem. Imagine a layer with 100 neurons. Now imagine their total output should be 1. The average value of those neurons will be 0.01 -> that means you are making networks connection relying (because activations will stay very low, averagely) - as other activation functions output (or take on input) of range (0:1 / -1:1).
If the output is a tanh function, then I get a number between -1 and 1.
How do I go about converting the output to the scale of my y values (which happens to be around 15 right now, but will vary depending on the data)?
Or am I restricted to functions which vary within some kind of known range...?
Just remove the tanh, and your output will be an unrestricted number. Your error function should probably be squared error.
You might have to change the gradient calculation for your back-prop, if this isn't done automatically by your framework.
Edit to add: You almost certainly want to keep the tanh (or some other non-linearity) between the recurrent connections, so remove it only for the output connection.
In most RNNs for classification, most people use a softmax layer on top of their LSTM or tanh layers so I think you can replace the softmax with just a linear output layer. This is what some people do for regular neural networks as well as convolutional neural networks. You will still have the nonlinearity from the hidden layers, but your outputs will not be restricted within a certain range such as -1 and 1. The cost function would probably be the squared error like larspars mentioned.
I have a neural network with 2 entry variables, 1 hidden layer with 2 neurons and the output layer with one output neuron. When I start with some randomly (from 0 to 1) generated weights, the network learns the XOR function very fast and good, but in other cases, the network NEVER learns the XOR function! Do you know why this happens and how can I overcome this problem? Could some chaotic behaviour be involved? Thanks!
This is quite normal situation, because error function for multilayer NN is not convex, and optimization converges to local minimum.
You can just keep initial weights that resulted in successful optimization, or run optimizer multiple times starting from different weights, and keep the best solution. Optimization algorithm and learning rate also plays certain role, for example backpropagation with momentum and/or stochastic gradient descent sometimes work better. Also, if you add more neurons, beyond the minimum needed to learn XOR, this also helps.
There exist methodologies designed to find global minimum, such as simulated annealing, but, in practice they are not commonly used for NN optimization, except for some specific cases
I have ran some classification tests in Matlab with feed forward network. Using the standard tansig function the results were better when using more neurons on the hidden layer.
But, when I switched to pure lin I was surprised to see that the results were better when I set a smaller number of neurons on the hidden layer.
Can you help me with an argument for these situation?
The tansig activation function essentially makes it possible than a neuron becomes inactive due to saturation. A linear neuron is always active. Therefore if one linear neuron has bad parameters, it will always affect the outcome of the classification. A higher number of neurons yield a higher probability of bad behavior in this scenario.