How to avoid Softmax activation failure when using ReLU in hidden layers? - neural-network

I'm training a my own implementation of an ANN with the MNIST dataset to identify numbers.
In other cases ReLU activation in the hidden layers has given me the best results for regression problems, however if I use softmax activation for the output layer together with ReLU in the hidden layers in a classification problem, softmax fails due to the high values obtained by ReLU, so I'm forced to use sigmoid or another bounded activation function.
I'm wondering if this is unavoidable or is there some technique I could use to avoid this?

Related

Can a single input single output neural network with y=x as activation function reflect non-linear behavior?

I am currently learning a little bit about neural networks. One question I can't really get behind is about how neural networks reflect non-linear behavior. From my understanding there is no possibility to reflect non-linear behavior inside a compact set using a neural network.
For example if I would take the function from this question:
y = x^2
and I would use a neural network with a single input and single output the best the neural network could do for each compact set [x0...xn] is a linear function spanning from one end of the set to the other, as at the end all calculations inside the net are linear.
Do I have some misunderstanding about this concept?
The ANN's capabilties to model non-linear behaviour arise from the (usually) non-linear activation function.
If the activation function is linear, then the process of training the network is just another way to create a linear (or multi-linear) fit of input and output data.
Activation function in neural networks is exactly the part, that brings non-linearity. If you use linear activation function, then you cannot train non-linear model (thus fit quadratic or other non-linear functions).
The part, I guess, you are interested in is Universal Approximation Theorem, which says that any continuous function can be approximated with a neural network with a single hidden layer (some assumptions on activation function are applied thou). Take into account, that this theorem does not say anything on optimization of such a network (it does not guarantee you can train such a network with a specific algorithm, but only that such a network exists). Also it does not say anything on the number of neurons you should use.
You can check following links, to get more details:
Original proof with sigmoid activation function: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.7873&rep=rep1&type=pdf
And a more friendly derivation: http://mcneela.github.io/machine_learning/2017/03/21/Universal-Approximation-Theorem.html

Is it necessary to use a linear bottleneck layer for autoencoder?

I'm currently trying to use an autoencoder network for dimensionality reduction.
(i.e. using the bottleneck activation as the compressed feature)
I noticed that a lot of studies that used autoencoder for this task uses a linear bottleneck layer.
By intuition, I think this makes sense since the usage of non-linear activation function may reduce the bottleneck feature's capability to represent the principle information contained within the original feature.
(e.g., ReLU ignores the negative values and sigmoid suppresses values too high or too low)
However, is this correct? And is using linear bottleneck layer for autoencoder necessary?
If it's possible to use a non-linear bootleneck layer, what activation function would be the best choice?
Thanks.
No, you are not limited to linear activation functions. An example of that is this work, where they use the hidden state of the GRU layers as an embedding for the input. The hidden state is obtained by using non-linear tanh and sigmoid functions in its computation.
Also, there is nothing wrong with 'ignoring' the negative values. The sparsity may, in fact, be beneficial. It can enhance the representation. The noise that can be created by other functions such as identity or sigmoid function may introduce false dependencies where there are none. By using ReLU we can represent the lack of dependency properly (as a zero) as opposed to some near zero value which is likely for e.g. sigmoid function.

why is tanh performing better than relu in simple neural network

Here is my scenario
I have used EMNIST database of capital letters of english language.
My neural network is as follows
Input layer has 784 neurons which are pixel values of image 28x28 grey scaled image divided by 255 so value will be in range[0,1]
Hidden layer has 49 neuron fully connected to previous 784.
Output layer has 9 neurons denoting class of image.
Loss function is defined as cross entropy of softmax of output layer.
Initialized all weights as random real number from [-1,+1].
Now I did training with 500 fixed samples for each class.
Simply, passed 500x9 images to train function which uses backpropagation and does 100 iterations changing weights by learning_rate*derivative_of_loss_wrt_corresponding_weight.
I found that when I use tanh activation on neuron then network learns faster than relu with learning rate 0.0001.
I concluded that because accuracy on fixed test dataset was higher for tanh than relu . Also , loss value after 100 epochs was slightly lower for tanh.
Isn't relu expected to perform better ?
Isn't relu expected to perform better ?
In general, no. RELU will perform better on many problems but not all problems.
Furthermore, if you use an architecture and set of parameters that is optimized to perform well with one activation function, you may get worse results after swapping in a different activation function.
Often you will need to adjust the architecture and parameters like learning rate to get comparable results. This may mean changing the number of hidden nodes and/or the learning rate in your example.
One final note: In the MNIST example architectures I have seen, hidden layers with RELU activations are typically followed by Dropout layers, whereas hidden layers with sigmoid or tanh activations are not. Try adding dropout after the hidden layer and see if that improves your results with RELU. See the Keras MNIST example here.

Are there cases where it is better to use sigmoid activation over ReLu

I am training a complex neural network architecture where I use a RNN for encoding my inputs then, A deep neural network with a softmax output layer.
I am now optimizing my architecture deep neural network part (number of units and number of hidden layers).
I am currently using sigmoid activation for all the layers. This seems to be ok for few hidden layer but as the number of layers grow, it seems that sigmoid is not the best choice.
Do you think I should do hyper-parameter optimization for sigmoid first then ReLu or, it is better to just use ReLu directly ?
Also, do you think that having Relu in the first hidden layers and sigmoid only in the last hidden layer makes sense given that I have a softmax output.
You can't optimize hyperparameters independently, no. Just because the optimal solution in the end happens to be X layers and Y nodes, doesn't mean that this will be true for all activation functions, regulazation strategies, learning rates, etc. This is what makes optimizing parameters tricky. That is also why there are libraries for hyperparameter optimization. I'd suggest you start out by reading up on the concept of 'random search optimization'.

How to convert RNN into a regression net?

If the output is a tanh function, then I get a number between -1 and 1.
How do I go about converting the output to the scale of my y values (which happens to be around 15 right now, but will vary depending on the data)?
Or am I restricted to functions which vary within some kind of known range...?
Just remove the tanh, and your output will be an unrestricted number. Your error function should probably be squared error.
You might have to change the gradient calculation for your back-prop, if this isn't done automatically by your framework.
Edit to add: You almost certainly want to keep the tanh (or some other non-linearity) between the recurrent connections, so remove it only for the output connection.
In most RNNs for classification, most people use a softmax layer on top of their LSTM or tanh layers so I think you can replace the softmax with just a linear output layer. This is what some people do for regular neural networks as well as convolutional neural networks. You will still have the nonlinearity from the hidden layers, but your outputs will not be restricted within a certain range such as -1 and 1. The cost function would probably be the squared error like larspars mentioned.