Should I use reguaization with Loss function or NN layer? - neural-network

I'm confused regarding the place of using regularization. In the theory, I saw regularization has been used with the Loss function.
But in the time implementation in Keras, I saw regularization has been used in the neural network layer.
from keras import regularizers
model.add(Dense(64, input_dim=64, kernel_regularizer=regularizers.l2(0.01)
model.add(Dense(28, input_dim=64, kernel_regularizer=regularizers.l1(0.05)
Here I used L1 and L2 loss in different layers. So How the final loss function will be calculated?

Taken from Keras documentation:
Regularizers allow you to apply penalties on layer parameters or layer activity during optimization. These penalties are summed into the loss function that the network optimizes.
Indeed, it takes the error terms of l1/l2 regularization and adds them to the loss for that layer during the backpropagation.


Is it necessary to use a linear bottleneck layer for autoencoder?

I'm currently trying to use an autoencoder network for dimensionality reduction.
(i.e. using the bottleneck activation as the compressed feature)
I noticed that a lot of studies that used autoencoder for this task uses a linear bottleneck layer.
By intuition, I think this makes sense since the usage of non-linear activation function may reduce the bottleneck feature's capability to represent the principle information contained within the original feature.
(e.g., ReLU ignores the negative values and sigmoid suppresses values too high or too low)
However, is this correct? And is using linear bottleneck layer for autoencoder necessary?
If it's possible to use a non-linear bootleneck layer, what activation function would be the best choice?
No, you are not limited to linear activation functions. An example of that is this work, where they use the hidden state of the GRU layers as an embedding for the input. The hidden state is obtained by using non-linear tanh and sigmoid functions in its computation.
Also, there is nothing wrong with 'ignoring' the negative values. The sparsity may, in fact, be beneficial. It can enhance the representation. The noise that can be created by other functions such as identity or sigmoid function may introduce false dependencies where there are none. By using ReLU we can represent the lack of dependency properly (as a zero) as opposed to some near zero value which is likely for e.g. sigmoid function.

why is tanh performing better than relu in simple neural network

Here is my scenario
I have used EMNIST database of capital letters of english language.
My neural network is as follows
Input layer has 784 neurons which are pixel values of image 28x28 grey scaled image divided by 255 so value will be in range[0,1]
Hidden layer has 49 neuron fully connected to previous 784.
Output layer has 9 neurons denoting class of image.
Loss function is defined as cross entropy of softmax of output layer.
Initialized all weights as random real number from [-1,+1].
Now I did training with 500 fixed samples for each class.
Simply, passed 500x9 images to train function which uses backpropagation and does 100 iterations changing weights by learning_rate*derivative_of_loss_wrt_corresponding_weight.
I found that when I use tanh activation on neuron then network learns faster than relu with learning rate 0.0001.
I concluded that because accuracy on fixed test dataset was higher for tanh than relu . Also , loss value after 100 epochs was slightly lower for tanh.
Isn't relu expected to perform better ?
In general, no. RELU will perform better on many problems but not all problems.
Furthermore, if you use an architecture and set of parameters that is optimized to perform well with one activation function, you may get worse results after swapping in a different activation function.
Often you will need to adjust the architecture and parameters like learning rate to get comparable results. This may mean changing the number of hidden nodes and/or the learning rate in your example.
One final note: In the MNIST example architectures I have seen, hidden layers with RELU activations are typically followed by Dropout layers, whereas hidden layers with sigmoid or tanh activations are not. Try adding dropout after the hidden layer and see if that improves your results with RELU. See the Keras MNIST example here.

Are there cases where it is better to use sigmoid activation over ReLu

I am training a complex neural network architecture where I use a RNN for encoding my inputs then, A deep neural network with a softmax output layer.
I am now optimizing my architecture deep neural network part (number of units and number of hidden layers).
I am currently using sigmoid activation for all the layers. This seems to be ok for few hidden layer but as the number of layers grow, it seems that sigmoid is not the best choice.
Do you think I should do hyper-parameter optimization for sigmoid first then ReLu or, it is better to just use ReLu directly ?
Also, do you think that having Relu in the first hidden layers and sigmoid only in the last hidden layer makes sense given that I have a softmax output.
You can't optimize hyperparameters independently, no. Just because the optimal solution in the end happens to be X layers and Y nodes, doesn't mean that this will be true for all activation functions, regulazation strategies, learning rates, etc. This is what makes optimizing parameters tricky. That is also why there are libraries for hyperparameter optimization. I'd suggest you start out by reading up on the concept of 'random search optimization'.

Activation function after pooling layer or convolutional layer?

The theory from these links show that the order of Convolutional Network is: Convolutional Layer - Non-linear Activation - Pooling Layer.
Neural networks and deep learning (equation (125)
Deep learning book (page 304, 1st paragraph)
Lenet (the equation)
The source in this headline
But, in the last implementation from those sites, it said that the order is: Convolutional Layer - Pooling Layer - Non-linear Activation
The sourcecode, LeNetConvPoolLayer class
I've tried too to explore a Conv2D operation syntax, but there is no activation function, it's only convolution with flipped kernel. Can someone help me to explain why is this happen?
Well, max-pooling and monotonely increasing non-linearities commute. This means that MaxPool(Relu(x)) = Relu(MaxPool(x)) for any input. So the result is the same in that case. So it is technically better to first subsample through max-pooling and then apply the non-linearity (if it is costly, such as the sigmoid). In practice it is often done the other way round - it doesn't seem to change much in performance.
As for conv2D, it does not flip the kernel. It implements exactly the definition of convolution. This is a linear operation, so you have to add the non-linearity yourself in the next step, e.g. theano.tensor.nnet.relu.
In many papers people use conv -> pooling -> non-linearity. It does not mean that you can't use another order and get reasonable results. In case of max-pooling layer and ReLU the order does not matter (both calculate the same thing):
You can proof that this is the case by remembering that ReLU is an element-wise operation and a non-decreasing function so
The same thing happens for almost every activation function (most of them are non-decreasing). But does not work for a general pooling layer (average-pooling).
Nonetheless both orders produce the same result, Activation(MaxPool(x)) does it significantly faster by doing less amount of operations. For a pooling layer of size k, it uses k^2 times less calls to activation function.
Sadly this optimization is negligible for CNN, because majority of the time is used in convolutional layers.
Max pooling is a sample-based discretization process. The objective is to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned

Step function versus Sigmoid function

I don't quite understand why a sigmoid function is seen as more useful (for neural networks) than a step function... hoping someone can explain this for me. Thanks in advance.
The (Heaviside) step function is typically only useful within single-layer perceptrons, an early type of neural networks that can be used for classification in cases where the input data is linearly separable.
However, multi-layer neural networks or multi-layer perceptrons are of more interest because they are general function approximators and they are able to distinguish data that is not linearly separable.
Multi-layer perceptrons are trained using backpropapagation. A requirement for backpropagation is a differentiable activation function. That's because backpropagation uses gradient descent on this function to update the network weights.
The Heaviside step function is non-differentiable at x = 0 and its derivative is 0 elsewhere. This means gradient descent won't be able to make progress in updating the weights and backpropagation will fail.
The sigmoid or logistic function does not have this shortcoming and this explains its usefulness as an activation function within the field of neural networks.
It depends on the problem you are dealing with. In case of simple binary classification, a step function is appropriate. Sigmoids can be useful when building more biologically realistic networks by introducing noise or uncertainty. Another but compeletely different use of sigmoids is for numerical continuation, i.e. when doing bifurcation analysis with respect to some parameter in the model. Numerical continuation is easier with smooth systems (and very tricky with non-smooth ones).