When to use layernorm/batch norm? - neural-network

Where should you splice the normalization when designing a network? E.g. if you have a stacked Transformer or Attention network, does it make sense to normalize any time after you have a dense layer?

What the original paper tries to explain is to reduce overfitting use Batch Normalization.
Where should you splice the normalization when designing a network?
Set the normalization early on inputs. Unbalanced input extreme values can cause instability.
While if you normalize on outputs this will not prevent the inputs to cause the instability all over again.
Here is the little code that explains what the BN do:
import torch
import torch.nn as nn
m = nn.BatchNorm1d(100, affine=False)
input = 1000*torch.randn(3, 100)
print(input)
output = m(input)
print(output)
print(output.mean()) # should be ~ 0
print(output.std()) # should be ~ 1
Does it make sense to normalize any time after you have a dense layer
Yes, you may do so as matrix multiplication may lead to producing the extremes. Also, after convolution layers, because these are also matrix multiplication, similar but less intense comparing to dense (nn.Linear) layer. If you for instance print the resent model, you will see that batch norms are set every time after the conv layer like this:
(conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
To print the full resnet you may use this:
import torchvision.models as models
r = models.resnet18()
print(r)

Related

Is a fully connected layer equivalent to Flatten + Dense in Tensorflow?

A fully-connected layer, also known as a dense layer, refers to the layer whose inside neurons connect to every neuron in the preceding layer (see Wikipedia).
In the MATLAB Deep Learning Toolkit, when defining a fullyConnectedLayer(n), the output will always be (borrowing the terminology from Tensorflow) a "tensor" of shape 1×1×n.
However, defining a dense layer in Keras via tf.keras.layers.Dense(n) will not result in a rank 1 tensor depending on the input, as explained in the Keras documentation:
For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units).
Am I correct in assuming that what MATLAB does in fullyConnectedLayer(n) is equivalent to cascading a Flatten() layer and a Dense(n) layer in Tensorflow? By equivalent I mean that exactly the same operation is performed.
It would appear that this is the case based on the number of weights that MATLAB requires for a fullyConnectedLayer. The weights in fact are n×M where M is the dimension of the input (see MATLAB Documentation: "At training time, Weights is an OutputSize-by-InputSize matrix"). In fact snooping around the internals of this MATLAB function, it seems to me that the InputSize is precisely the size of the input if it were "flattened", i.e. M = a*b*c if the input tensor has shape (a,b,c) (and of course I experimentally verified this by multiplying).
The layer I'm trying to build is towards the final stages of a categorical classifiers, so I need the final output of the Keras model to be of shape (None, n) where n is the number of labels in the training data.

Predict a 2D matrix from an image with keras keeping spatial information

I want to train a CNN to predict 100x100x1 matrices (heatmaps) from 224x224x13 images using Keras. My idea is to finetune pretrained networks that keras provide (resnet, Xception, vgg16 etc.).
The first step is then to substitute the pretrained top layers for the ones that meet my problem constraints. I am trying to predict 100x100x1 heatmap images whose values range from 0 to 1. Therefore I want the output of my network to be a 100x100x1 matrix. I believe that if I use Flatten and then a Dense layer of 1000x1x1 I will be loosing spatial information, which I don't want (right?).
I want my code to be flexible, to be able to run independent from which pretrained architecture is being used (I have to run many experiments). Therefore I want to stack a Dense layer that connects to every unit of whatever kind of layer is before it (which will depend on the pretrained architecture I will be using).
Some answers relate to the fully convolutional approach, but that is not what I mean here. Both my X and Y have fixed shapes (224x224x3 and 100x100x1 respectively).
My problem is that I don't now how to stack the new layer/s in such a way that the predictions/outputs of the net are 100x100x1 matrices.
As it has been suggested in the answers, I am trying to add a 100x100x1 Dense layer. But I don't seem to get it working:
If for example I to like this:
x = self.base_model.output
predictions = keras.layers.Dense(input_shape = (None, 100,100), units= 1000, activation='linear')(x)
self.model = keras.models.Model(input=self.base_model.input, output=predictions)
Then I got this when I start training:
ValueError: Error when checking target: expected dense_1 to have 4 dimensions, but got array with shape (64, 100, 100)
The Y of the network are indeed batches of shape (64,100,100)
Any suggestions?
Also, which loss function should I use? As it has been suggested in the answers, I could use mse but I wonder, is there any loss function that is able to measure the spatial information of my desired 100x100x1 output?
Thanks in advance.
EDIT:
I semi-solved my problem thanks to #ncasas answer:
I just added some deconvolutional layers until I got an output that was similar to the 100x100x1. This is not what I wanted on the first place, since this implementation is not agnostic to the pretrained architecture that is built on top of. For Xception with input_shape = (224, 224, 3), this top layers give an output of 80x80x1:
x = self.base_model.output
x = keras.layers.Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.UpSampling2D((3, 3))(x)
x = keras.layers.Conv2D(16, (3, 3), activation='relu', padding='same')(x)
x = keras.layers.UpSampling2D((2, 2))(x)
x = keras.layers.Conv2D(16, (3, 3), activation='relu')(x)
x = keras.layers.UpSampling2D((2, 2))(x)
predictions = keras.layers.Conv2D(filters = 1,
kernel_size = (3, 3),
activation='sigmoid',
padding='same')(x)
self.model = keras.models.Model(input=self.base_model.input, output=predictions)
where self.base_model is keras.applications.Xception(weights = 'imagenet', include_top = False, input_shape = (224, 224, 3))
I am finally using mse as loss function and it works just fine.
What you are describing is a multidimiensional linear regression plus transfer learning.
In order to reuse the first layers of a trained Keras model, you can follow this post from the Keras blog, in section "Using the bottleneck features of a pre-trained network: 90% accuracy in a minute". For your case, the only difference is that:
For the layer before the last one, you should probably have something larger than 256.
The last layer would be a 10000 units Dense layer with linear activations (i.e. no activation at all). You can either reshape your expected outputs from 100x100 to 100000, or add an extra reshape layer to the network to have a 100x100 output.
Keep in mind that between the convolutional part of the network and the multilayer perceptron part (i.e. the final Dense layer(s)) there must be a Flatten layer to place the obtain activation patterns in a single matrix (search the linked post for "Flatten"); the error you receive is because of that.
If you don't want to flatten the activation patterns, you may want to directly use deconvolutions in your last layers. For that, you can take a look at the keras autoencoder tutorial, at section "Convolutional autoencoder".
The usual loss function used for regression problems is mean squared error (MSE). It does not make sense to use cross entropy for regression, as explained here.
You should find this paper helpful. Simply replace the fully connected layers with convolutional layers. Instead of a single prediction for the entire image, the result will be a heatmap of predictions for smaller portions of the image.
You should use the categorical_crossentropy loss function.

Correctly applying dropout in CNTK

I'm applying dropout as follows in a three hidden layer feed-forward network, using the Python API. My results are not very good and I wonder if I'm misapplying the dropout layer- is it better to apply it to the input of the dense layer, or internally, to the output of the first linear layer?
def dense_layer(input, output_dim, nonlinearity):
r = linear_layer(input, output_dim)
r = dropout(r, 0.25)
r = nonlinearity(r)
return r;
If 0 dropout works better, why you believe that you need a dropout? Does your network overfit? Do you have other regularization? It would be good to have more detail on the network architecture and the data.

2-Dence neural network accuracy optimization

Below you can see the dataset i use"sorted_output" in order to construct an ANN with 2 hidden dense layers and one input, one output layer. My question is why am i getting extremely low accuracy(62,5%)? I have the feeling that the fact that since both my input data (columns A-U) and my output data (column V) are in binary form, this should lead me to 100% accuracy. Am i wrong?
from keras.models import Sequential
from keras.layers import Dense
from sklearn.cross_validation import train_test_split
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
dataset = numpy.loadtxt("sorted_output.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:21]
Y = dataset[:,21]
# split into 67% for train and 33% for test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)
# create model
model = Sequential()
model.add(Dense(12, input_dim=21, init='orthogonal', activation='relu'))
model.add(Dense(10, init='uniform', activation='relu'))
model.add(Dense(1, init='orthogonal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
# Fit the model
model.fit(X_train, y_train, validation_data=(X_test,y_test), nb_epoch=150, batch_size=10)
Accuracy of the network depends on many factors. It is really difficult to understand why the accuracy is so low in your case. It really depends on the underlying data distribution and how good you network is at capturing the relevant information during training.
I suggest you to monitor the loss and see if the model is overfitting the train data. If that is the case, you might have to use some kind of regularization to improve the generalization. Otherwise, you can increase the depth of the network and check if results are better.
These are by no means an exhaustive list of approaches. Sometimes changing the optimizer also helps, depending on how your data distributed.

Is it desirable to scale data for skflow.TensorFlowDNNClassifier?

My colleagues and this question on Cross Validated say you should transform data to zero mean and unit variance for neural networks. However, my performance was slightly worse with scaling than without.
I tried using:
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
steps = 5000
def exp_decay(global_step):
return tf.train.exponential_decay(
learning_rate=0.1, global_step=global_step,
decay_steps=steps, decay_rate=0.01)
random.seed(42) # to sample data the same way
classifier = skflow.TensorFlowDNNClassifier(
hidden_units=[150, 150, 150],
n_classes=2,
batch_size=128,
steps=steps,
learning_rate=exp_decay)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Did I do something wrong or is scaling not necessary?
Usually scaling benefits most for models that don't have regularization and linear models. For example simple mean squared error loss (like in TensorFlowLinearRegressor) without regularization won't work very well on not scaled data.
In your case you are using classifier that runs softmax regularization and you are using DNN, so scaling is not needed. DNNs themselve can model rescaling (via bias and weight on the feature in the first layer) if that's a useful thing to do.