caffe SqueezeNet: where is the fully connected FC layer in prototxt - neural-network

I am working on caffe SqueezeNet prototxt link.
I am just wondering where is the FC layer? (I only see type: data, conv, relu, pooling, concat, SoftmaxWithLoss and accuracy)

The reason is that FC layers have a ton of parameters, counting for the majority of the network's parameters in some architectures. The authors of SqueezeNet removed the FCs, replacing them with a convolutional layer and a global average pooling.
The conv layer has a number of filters equal to the number of classes, processing the output of a previous layer to (roughly) a map for each class. The pooling averages the response of each of these maps. They end up with a flattened vector with dimension equal to the number of classes that is, then, fed to the SoftMax layer.
With these modifications (not forgetting the Fire modules they proposed) they were able to significantly reduce memory footprint.
I strongly recommend that you read the SqueezeNet paper.

SqueezeNet doesn't have fully connected layers, it uses global average pooling instead.

Related

Does BatchNormalization count as a layer in a network?

Is BatchNormalizationLayer considered a layer in a neural network?
For example, if we say, Resnet50 has 50 layers, does that mean that some of those layers may be batchnormalization layers?
When building models in Keras I considered it as an extra, similar to a dropout layer or when adding an “Activation layer”. But BatchNormalization has trainable parameters, so... I am confused
In DeepLearning literature, an X layer network simply refers to the usage of learnable layers that constitute the representational capacity of the network.
Activation layers, normalization layers (such as NLR, BatchNorm, etc), Downsampling layers (such as Maxpooling, etc) are not considered.
Layers such as CNN, RNN, FC, and the likes that are responsible for the representational capacity of the network are counted.
It really depends on how precise you define what a "layer" is. This may vary for different authors.
For your ResNet example it is pretty clear: In Section 3.4 Implementation you'll a description of the network, there it say's:
We adopt batch normalization (BN) right after each convolution and
before activation, [...].
So convolution and batch normalization is considered as a single layer. Figure 3. in the paper shows a picture of ResNet34 where the batch normalization layers are not even explicitly shown and the layers sum up to 34.
So in conclusion, the ResNet paper does not count batch normalization as extra layer.
Further Keras makes it really easy to check those things for many pretrained models, e.g.:
import tensorflow as tf
resnet = tf.keras.applications.ResNet50()
print(resnet.summary())

Implementing FC layers as Conv layers

I understand that implementing Fully Connected Layer as Convolution Layer reduces parameter, but does it increases Computational Speed. If Yes, then why do people still use Fully Connected Layers?
Convolutional layers are used for low-level reasoning like feature extraction. At this level, using a fully connected layer would be wasteful of resources, because so many more parameters have to be computed. If you have an image of size 32x32x3, a fully connected layer would require computation of 32*32*3 = 3072 weights for the first layer. These many parameters are not required for low-level reasoning. Features tend to have spatial locality in images, and local connectivity is sufficient for feature extraction. If you used convolutional layer, 12 filters of size 3x3, you only need to calculate 12*3*3 = 108 weights.
Fully connected layers are used for high-level reasoning. These are the layers in the network which determine the final output of the convolutional network. As the reasoning becomes more complex, local connectivity is no longer sufficient, which is why fully connected layers are used in later stages of the network.
Please read this for a more detailed and visual explanation

units of neural network layer are independent?

In neural network, there are 3 main parts defined as input layer, hidden layer and output layer. Is there any correlation between units of hidden layer? For example, for 1st and 2nd neurons of hidden layer are independent of each other, or there is a relation between each other? Is there any source that explains this issue?
Answer depends on many factors. From probabilistic perspective they are independent given inputs and before training. If input is not fixed then they are heavily correlated (as two "almost linear" functions of the same input signal). Finally, after training they will be strongly correlated, and exact correlations will depend on initialisation and training itself.

Activation function after pooling layer or convolutional layer?

The theory from these links show that the order of Convolutional Network is: Convolutional Layer - Non-linear Activation - Pooling Layer.
Neural networks and deep learning (equation (125)
Deep learning book (page 304, 1st paragraph)
Lenet (the equation)
The source in this headline
But, in the last implementation from those sites, it said that the order is: Convolutional Layer - Pooling Layer - Non-linear Activation
network3.py
The sourcecode, LeNetConvPoolLayer class
I've tried too to explore a Conv2D operation syntax, but there is no activation function, it's only convolution with flipped kernel. Can someone help me to explain why is this happen?
Well, max-pooling and monotonely increasing non-linearities commute. This means that MaxPool(Relu(x)) = Relu(MaxPool(x)) for any input. So the result is the same in that case. So it is technically better to first subsample through max-pooling and then apply the non-linearity (if it is costly, such as the sigmoid). In practice it is often done the other way round - it doesn't seem to change much in performance.
As for conv2D, it does not flip the kernel. It implements exactly the definition of convolution. This is a linear operation, so you have to add the non-linearity yourself in the next step, e.g. theano.tensor.nnet.relu.
In many papers people use conv -> pooling -> non-linearity. It does not mean that you can't use another order and get reasonable results. In case of max-pooling layer and ReLU the order does not matter (both calculate the same thing):
You can proof that this is the case by remembering that ReLU is an element-wise operation and a non-decreasing function so
The same thing happens for almost every activation function (most of them are non-decreasing). But does not work for a general pooling layer (average-pooling).
Nonetheless both orders produce the same result, Activation(MaxPool(x)) does it significantly faster by doing less amount of operations. For a pooling layer of size k, it uses k^2 times less calls to activation function.
Sadly this optimization is negligible for CNN, because majority of the time is used in convolutional layers.
Max pooling is a sample-based discretization process. The objective is to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned

Theanets: Removing individual connections

How do you remove connections in Theanets? I'd like to create custom connectivity between an input layer, a single hidden layer, and an output layer. But the only defaults are feedforward all-to-all architectures or recurrent architectures. I'd like to remove specific connections from the all-to-all connectivity and then train the network.
Thanks in advance.
(Developer of theanets here.)
This is currently not directly possible with theanets. For computational efficiency the underlying computations in feedforward networks are implemented as simple matrix operations, which are fast and can be executed on a GPU for sometimes dramatic speedups.
You can, however, initialize the weights in a layer so that some (or many) of the weights are zero. To do this, just pass a dictionary in the layers list, and include a sparsity key:
import theanets
net = theanets.Autoencoder(
layers=(784, dict(size=1000, sparsity=0.9), 784))
This initializes the weights for the layer so that the given fraction of weights are zeros. The weights are, however, eligible for change during the training process, so this is only an initialization trick.
You can, however, implement a custom Layer subclass that does whatever you like, as long as you stay within the Theano boundaries. You could, for instance, implement a type of feedforward layer that uses a mask to ensure that some weights remain zeros during the feedforward computation.
For more details you might want to ask on the theanets mailing list.