Neural network converges more slowly with htan - neural-network

I implemented a neural network with sigmoid and hyperbolic tangent activation functions. One thing I have noticed is that sigmoid converges much faster than htan. They both converge and I have checked my code and am confident no coding bugs exist. Are neural networks using sigmoid naturally faster than htan, or might I be doing something wrong?

Related

Supervised Neural Networks

I am reading a lot of articles about neural networks and I found very different information. I understand that the supervised neural network can be also regression and classification. In both cases I can use the sigmoid function but what is the difference?
A single-layer neural network is essentially the same thing as linear regression. That's because of how neural networks work: Each input gets weighted with a weight factor to produce an output, and the weight factors are iteratively chosen in such a way that the error (the discrepancy between the outputs produced by the model and the correct output that should be produced for a given input) is minimised. Linear regression does the same thing. But in a neural network, you can stack several of such layers on top of each other.
Classification is a potential, but by far not the only, use case for neural networks. Conversely, there are classification algorithms that don't use neural networks (e.g. K-nearest neighbours). The sigmoid function is often used as an activation function for the last layer in a classifier neural network.

Can a single input single output neural network with y=x as activation function reflect non-linear behavior?

I am currently learning a little bit about neural networks. One question I can't really get behind is about how neural networks reflect non-linear behavior. From my understanding there is no possibility to reflect non-linear behavior inside a compact set using a neural network.
For example if I would take the function from this question:
y = x^2
and I would use a neural network with a single input and single output the best the neural network could do for each compact set [x0...xn] is a linear function spanning from one end of the set to the other, as at the end all calculations inside the net are linear.
Do I have some misunderstanding about this concept?
The ANN's capabilties to model non-linear behaviour arise from the (usually) non-linear activation function.
If the activation function is linear, then the process of training the network is just another way to create a linear (or multi-linear) fit of input and output data.
Activation function in neural networks is exactly the part, that brings non-linearity. If you use linear activation function, then you cannot train non-linear model (thus fit quadratic or other non-linear functions).
The part, I guess, you are interested in is Universal Approximation Theorem, which says that any continuous function can be approximated with a neural network with a single hidden layer (some assumptions on activation function are applied thou). Take into account, that this theorem does not say anything on optimization of such a network (it does not guarantee you can train such a network with a specific algorithm, but only that such a network exists). Also it does not say anything on the number of neurons you should use.
You can check following links, to get more details:
Original proof with sigmoid activation function: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.7873&rep=rep1&type=pdf
And a more friendly derivation: http://mcneela.github.io/machine_learning/2017/03/21/Universal-Approximation-Theorem.html

Are there cases where it is better to use sigmoid activation over ReLu

I am training a complex neural network architecture where I use a RNN for encoding my inputs then, A deep neural network with a softmax output layer.
I am now optimizing my architecture deep neural network part (number of units and number of hidden layers).
I am currently using sigmoid activation for all the layers. This seems to be ok for few hidden layer but as the number of layers grow, it seems that sigmoid is not the best choice.
Do you think I should do hyper-parameter optimization for sigmoid first then ReLu or, it is better to just use ReLu directly ?
Also, do you think that having Relu in the first hidden layers and sigmoid only in the last hidden layer makes sense given that I have a softmax output.
You can't optimize hyperparameters independently, no. Just because the optimal solution in the end happens to be X layers and Y nodes, doesn't mean that this will be true for all activation functions, regulazation strategies, learning rates, etc. This is what makes optimizing parameters tricky. That is also why there are libraries for hyperparameter optimization. I'd suggest you start out by reading up on the concept of 'random search optimization'.

Step function versus Sigmoid function

I don't quite understand why a sigmoid function is seen as more useful (for neural networks) than a step function... hoping someone can explain this for me. Thanks in advance.
The (Heaviside) step function is typically only useful within single-layer perceptrons, an early type of neural networks that can be used for classification in cases where the input data is linearly separable.
However, multi-layer neural networks or multi-layer perceptrons are of more interest because they are general function approximators and they are able to distinguish data that is not linearly separable.
Multi-layer perceptrons are trained using backpropapagation. A requirement for backpropagation is a differentiable activation function. That's because backpropagation uses gradient descent on this function to update the network weights.
The Heaviside step function is non-differentiable at x = 0 and its derivative is 0 elsewhere. This means gradient descent won't be able to make progress in updating the weights and backpropagation will fail.
The sigmoid or logistic function does not have this shortcoming and this explains its usefulness as an activation function within the field of neural networks.
It depends on the problem you are dealing with. In case of simple binary classification, a step function is appropriate. Sigmoids can be useful when building more biologically realistic networks by introducing noise or uncertainty. Another but compeletely different use of sigmoids is for numerical continuation, i.e. when doing bifurcation analysis with respect to some parameter in the model. Numerical continuation is easier with smooth systems (and very tricky with non-smooth ones).

Why do sigmoid functions work in Neural Nets?

I have just started programming for Neural networks. I am currently working on understanding how a Backpropogation (BP) neural net works. While the algorithm for training in BP nets is quite straightforward, I was unable to find any text on why the algorithm works. More specifically, I am looking for some mathematical reasoning to justify using sigmoid functions in neural nets, and what makes them mimic almost any data distribution thrown at them.
Thanks!
The sigmoid function introduces non-linearity in the network. Without a non-linear activation function, the net can only learn functions which are linear combinations of its inputs. The result is called universal approximation theorem or Cybenko theorem, after the gentleman who proved it in 1989. Wikipedia is a good place to start, and it has a link to the original paper (the proof is somewhat involved though). The reason why you would use a sigmoid as opposed to something else is that it is continuous and differentiable, its derivative is very fast to compute (as opposed to the derivative of tanh, which has similar properties) and has a limited range (from 0 to 1, exclusive)