Is BatchNormalizationLayer considered a layer in a neural network?
For example, if we say, Resnet50 has 50 layers, does that mean that some of those layers may be batchnormalization layers?
When building models in Keras I considered it as an extra, similar to a dropout layer or when adding an “Activation layer”. But BatchNormalization has trainable parameters, so... I am confused
In DeepLearning literature, an X layer network simply refers to the usage of learnable layers that constitute the representational capacity of the network.
Activation layers, normalization layers (such as NLR, BatchNorm, etc), Downsampling layers (such as Maxpooling, etc) are not considered.
Layers such as CNN, RNN, FC, and the likes that are responsible for the representational capacity of the network are counted.
It really depends on how precise you define what a "layer" is. This may vary for different authors.
For your ResNet example it is pretty clear: In Section 3.4 Implementation you'll a description of the network, there it say's:
We adopt batch normalization (BN) right after each convolution and
before activation, [...].
So convolution and batch normalization is considered as a single layer. Figure 3. in the paper shows a picture of ResNet34 where the batch normalization layers are not even explicitly shown and the layers sum up to 34.
So in conclusion, the ResNet paper does not count batch normalization as extra layer.
Further Keras makes it really easy to check those things for many pretrained models, e.g.:
import tensorflow as tf
resnet = tf.keras.applications.ResNet50()
print(resnet.summary())
Related
I understand that implementing Fully Connected Layer as Convolution Layer reduces parameter, but does it increases Computational Speed. If Yes, then why do people still use Fully Connected Layers?
Convolutional layers are used for low-level reasoning like feature extraction. At this level, using a fully connected layer would be wasteful of resources, because so many more parameters have to be computed. If you have an image of size 32x32x3, a fully connected layer would require computation of 32*32*3 = 3072 weights for the first layer. These many parameters are not required for low-level reasoning. Features tend to have spatial locality in images, and local connectivity is sufficient for feature extraction. If you used convolutional layer, 12 filters of size 3x3, you only need to calculate 12*3*3 = 108 weights.
Fully connected layers are used for high-level reasoning. These are the layers in the network which determine the final output of the convolutional network. As the reasoning becomes more complex, local connectivity is no longer sufficient, which is why fully connected layers are used in later stages of the network.
Please read this for a more detailed and visual explanation
I am working on caffe SqueezeNet prototxt link.
I am just wondering where is the FC layer? (I only see type: data, conv, relu, pooling, concat, SoftmaxWithLoss and accuracy)
The reason is that FC layers have a ton of parameters, counting for the majority of the network's parameters in some architectures. The authors of SqueezeNet removed the FCs, replacing them with a convolutional layer and a global average pooling.
The conv layer has a number of filters equal to the number of classes, processing the output of a previous layer to (roughly) a map for each class. The pooling averages the response of each of these maps. They end up with a flattened vector with dimension equal to the number of classes that is, then, fed to the SoftMax layer.
With these modifications (not forgetting the Fire modules they proposed) they were able to significantly reduce memory footprint.
I strongly recommend that you read the SqueezeNet paper.
SqueezeNet doesn't have fully connected layers, it uses global average pooling instead.
In neural network, there are 3 main parts defined as input layer, hidden layer and output layer. Is there any correlation between units of hidden layer? For example, for 1st and 2nd neurons of hidden layer are independent of each other, or there is a relation between each other? Is there any source that explains this issue?
Answer depends on many factors. From probabilistic perspective they are independent given inputs and before training. If input is not fixed then they are heavily correlated (as two "almost linear" functions of the same input signal). Finally, after training they will be strongly correlated, and exact correlations will depend on initialisation and training itself.
related to How to Create CaffeDB training data for siamese networks out of image directory
If I have N labels. How can I enforce, that the feature vector of size N right before the contrastive loss layer represents some kind of probability for each class? Or comes that automatically with the siamese net design?
If you only use contrastive loss in a Siamese network, there is no way of forcing the net to classify into the correct label - because the net is only trained using "same/not same" information and does not know the semantics of the different classes.
What you can do is train with multiple loss layers.
You should aim at training a feature representation that is reach enough for your domain, so that looking at the trained feature vector of some input (in some high dimension) you should be able to easily classify that input to the correct class. Moreover, given that feature representation of two inputs one should be able to easily say if they are "same" or "not same".
Therefore, I recommend that you train your deep network with two loss layer with "bottom" as the output of one of the "InnerProduct" layers. One loss is the contrastive loss. The other loss should have another "InnerProduct" layer with num_output: N and a "SoftmaxWithLoss" layer.
A similar concept was used in this work:
Sun, Chen, Wang and Tang Deep Learning Face Representation by Joint Identification-Verification NIPS 2014.
I've seen some tutorial examples, like UFLDL covolutional net, where they use features obtained by unsupervised learning, or some others, where kernels are engineered by hand (using Sobel and Gabor detectors, different sharpness/blur settings etc). Strangely, I can't find a general guideline on how one should choose a good kernel for something more than a toy network. For example, considering a deep network with many convolutional-pooling layers, are the same kernels used at each layer, or does each layer have its own kernel subset? If so, where do these, deeper layer's filters come from - should I learn them using some unsupervised learning algorithm on data passed through the first convolution-and-pooling layer pair?
I understand that this question doesn't have a singular answer, I'd be happy to just the the general approach (some review article would be fantastic).
The current state of the art suggest to learn all the convolutional layers from the data using backpropagation (ref).
Also, this paper recommend small kernels (3x3) and pooling (2x2). You should train different filters for each layer.
Kernels in deep networks are mostly trained all at the same time in a supervised way (known inputs and outputs of network) using Backpropagation (computes gradients) and some version of Stochastic Gradient Descent (optimization algorithm).
Kernels in different layers are usually independent. They can have different sizes and their numbers can differ as well. How to design a network is an open question and it depends on your data and the problem itself.
If you want to work with your own dataset, you should start with an existing pre-trained network [Caffe Model Zoo] and fine-tune it on your dataset. This way, the architecture of the network would be fixed, as you would have to respect the architecture of the original network. The networks you can donwload are trained on very large problems which makes them able to generalize well to other classification/regression problems. If your dataset is at least partly similar to the original dataset, the fine-tuned networks should work very well.
Good place to get more information is Caffe # CVPR2015 tutorial.