I am trying to implement Batch Normalization (http://arxiv.org/pdf/1502.03167.pdf) in my convolutional neural network, but I am really confused as around what axis I should calculate mean and variance.
If an input to the conv-layer is of shape 3 * 224 * 224 * 32
where:
3- input channels.
224 * 224- shape of single channel
32- minibatch size
What should be the axis in the following formula
Mean = numpy.mean(input_layer, axis= ? )
And, if an input to the fully connected layer is of shape 100 * 32
where:
100- number of inputs
32- minibatch size
Again, what should be the axis in the following formula
Mean = numpy.mean(input_layer, axis= ? )
# 1. axis = (1,2,3)
numpy.mean(input_layer,axis=(1,2,3))
# 2. axis = 1
numpy.mean(input_layer,axis=1)
For convolutional layers with shared weights it uses feature-wise normalization, for fully connected layers it uses sample-wise normalization.
Code of the BN layer of the Keras library for reference: https://github.com/fchollet/keras/blob/0daec53acbf4c3df6c054b36ece5c1ae2db55d86/keras/layers/normalization.py
Related
I'm reading Inception paper by Szegedy et al: https://arxiv.org/abs/1512.00567
and I'm having trouble understanding how they reduce the amount of computation by replacing a single 5x5 filter with 2 layers of 3x3 filters (section 3.1).
In particular, this passage:
If we would naivly slide a network without reusing the computation
between neighboring grid tiles, we would increase the computational
cost. sliding this network can be represented by two 3x3 convolutional
layers which reuses the activations between adjacent tiles.
I don't understand how we can reuse those activations.
I have been working through this confusion as well, seemingly every time I need to revisit the Inception papers.
The proper setting of the comparison is to think about a toy example input image that is 5x5 in shape. To produce a 5x5 output image with a 5x5 convolution, you need to pad the original image with 2 extra padded pixels on the top, bottom and sides, and then proceed with your usual 5x5 convolution. There are 25 weight parameters for the convolution filter, and every pixel of the output required a weighted sum of 25 items of the input.
Now, instead of the 5x5 filter, we will do two stages. First we will pad the original image with 1 extra pixel along the top, bottom and sides, to make it conformable for a standard 3x3 convolution at every point.
This produces an intermediate image, same shape as the input because of the padding, but where each pixel is the result of a 3x3 convolution (so each pixel commanded the weighted sum of 9 items).
Now we will repeat this again for a final stage 3x3 convolution, starting out from our intermediate image from the first 3x3 convolution. Again, we pad by 1 pixel on the top, bottom and sides, and each item of the output was achieved by a weighted sum of 9 items of the input.
The diagram you provided in your question demonstrates how this allows aggregating the same span of spatial information of a 5x5 convolution, but just calculated by a different set of two weighted sums for the two 3x3 convolutions. To be clear, the calculation is not the same, because the two 9-d filters of coefficients don't have to be learned to be the same as the 25-d filter of coefficients. They are likely different weights, but they can sequentially span the same distance of the original image as the 5x5 convolution could.
In the end, we can see that each unit of the output in the 5x5 case required 25 multiply-add ops. Each unit of the final output in the sequential 3x3 case required first 9 multiply-adds to make the units of the first 3x3 convolution and then 9 multiply-adds to make the units of the final output.
The specific comment about "activation sharing" refers to the fact that you only calculate the values of the intermediate 3x3 convolution one time. You spend 9 ops for each of those units, but once they are created, you only spend 9 more ops to get to the final output cells. You don't repeat the effort of creating the first 3x3 convolution over and over for each unit of the final output.
This is why it's not counted as requiring 81 operations per each output unit. When you bump to the next (i, j) position of the final 3x3 convolution output, you are re-using a bunch of pixels of the intermediate 3x3 convolution, so you're only doing 9 more ops to get to the final output each time.
The number of ops for a 5x5 padded convolution of a 5x5 input is 25 * 25.
The number of ops for the first 3x3 padded convolution is 25 * 9, and from there you add the cost of another padded 3x3 convolution, so overall it becomes 25 * 9 + 25 * 9 = 25 * 18.
This is how they get to the ratio of (25 * 25) / (25 * 18) = 25/18.
As it happens, this also is the same reduction for the total number of parameters as well.
I think the key is that the original diagram (from the paper and from your question) does a really bad job of indicating that you would first pay the standard 3x3 convolution cost to create the intermediate set of pixels across the whole original 5x5 input, including with padding. And then you would run the second 3x3 convolution across that intermediate result (this is what they mean by re-using activations).
The picture makes it seem like individually, for every final output pixel, you would slide the original 3x3 convolution around all the right spots for the entire 3x3 intermediate layer, compute the 9-ops weighted sum each time (81 ops overall), and then compute the final 9-ops weighted sum to get a single pixel of output. Then go back to the original, bump the convolutions over by 1 spot, and repeat. But this is not correct, and would not "re-use" the intermediate convolution layer, rather it would separately re-compute it for each unit of the final output layer.
Overall though, I agree this is super non-trivial and hard to think about. The paper really glosses over it and assumes way to much context is already in the mind of the reader.
So first of all, the author states this:
This way, we
end up with a net (9+9) /
25 × reduction of computation, resulting
in a relative gain of 28% by this factorization.
And he is right: for óne 5x5 filter, you have to use 25 (5*5) individual weights. For two 3x3 filters, you have to use 9 + 9 (3*3 + 3*3) individual weights. So using two 3x3 filters requires less parameters. However, you are right, this doesn't mean that it requires less computation: at first glance, using two 3x3 filters requires much more operations.
Let's compare the amount of operations for the two options for a given n*n input. The walkthrough:
Calculate output dimension of 5x5 filter on the given input ((n - filtersize + 1)^2), and its corresponding operations
Calculate the output dimension of the first 3x3 filter (same formula as above), and its corresponding operations
Calculate the output dimension of the second 3x3 filter,a nd its corresponding operations
Let's start with a 5x5 input:
1. (5 - 5 + 1)^2 = 1x1. So 1*1*25 operations = 25 operations
2. (5 - 3 + 1)^2 = 3x3. So 3*3*9 operations = 81 operations
3. (3 - 3 + 1)^2 = 1x1. So 1*1*9 operations = 9 operations
So 25 vs 90 operations. Using a single 5x5 filter is best for a 5x5 input.
Next, the 6x6 input:
1. (6 - 5 + 1)^2 = 2x2. So 2*2*25 operations = 100 operations
2. (6 - 3 + 1)^2 = 4x4. So 4*4*9 operations = 144 operations
3. (4 - 3 + 1)^2 = 2x2. So 2*2*9 operations = 36 operations
So 100 vs 180 operations. Using a single 5x5 filter is best for a 6x6 input.
Let's jump one ahead, 8x8 input:
1. (8 - 5 + 1)^2 = 4x4. So 4*4*25 operations = 400 operations
2. (8 - 3 + 1)^2 = 6x6. So 6*6*9 operations = 324 operations
3. (4 - 3 + 1)^2 = 4x4. So 4*4*9 operations = 144 operations
So 400 vs 468 operations. Using a single 5x5 filter is best for a 8x8 input.
Notice the pattern? Given an input size of n*n, the operations for a 5x5 filter have the following formula:
(n - 4)*(n - 4) * 25
And for 3x3 filter:
(n - 2)*(n - 2) * 9 + (n - 4) * (n - 4) * 9
So let's plot these:
They seem to intersect! As you may read from the graph above, the number of operations seems to be less for two 3x3 filters from n=10 and onwards!
Conclusion: it seems like it is effective to use two 3x3 filters after n=10. Additionally, regardless of n, less parameters need to be tuned for two 3x3 filters in comparision with a single 5x5 filter.
The article is kind of weird though, it makes it feel like using two 3x3 filters over óne 5x5 filter 'obvious' for some reason:
This setup clearly reduces the parameter count by sharing
the weights between adjacent tiles.
it seems natural
to exploit translation invariance again and replace the fully
connected component by a two layer convolutional architecture
If we would naivly slide
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
In the diagram (architecture) below, how was the (fully-connected) dense layer of 4096 units derived from last max-pool layer (on the right) of dimensions 256x13x13? Instead of 4096, shouldn't it be 256*13*13=43264 ?
If I'm correct, you're asking why the 4096x1x1 layer is much smaller.
That's because it's a fully connected layer. Every neuron from the last max-pooling layer (=256*13*13=43264 neurons) is connectd to every neuron of the fully-connected layer.
This is an example of an ALL to ALL connected neural network:
As you can see, layer2 is bigger than layer3. That doesn't mean they can't connect.
There is no conversion of the last max-pooling layer -> all the neurons in the max-pooling layer are just connected with all the 4096 neurons in the next layer.
The 'dense' operation just means calculate the weights and biases of all these connections (= 4096 * 43264 connections) and add the bias of the neurons to calculate the next output.
It's connected the same was an MLP.
But why 4096? There is no reasoning. It's just a choice. It could have been 8000, it could have been 20, it just depends on what works best for the network.
You are right in that the last convolutional layer has 256 x 13 x 13 = 43264 neurons. However, there is a max-pooling layer with stride = 3 and pool_size = 2. This will produce an output of size 256 x 6 x 6. You connect this to a fully-connected layer. In order to do that, you first have to flatten the output, which will take the shape - 256 x 6 x 6 = 9216 x 1. To map 9216 neurons to 4096 neurons, we introduce a 9216 x 4096 weight matrix as the weight of dense/fully-connected layer. Therefore, w^T * x = [9216 x 4096]^T * [9216 x 1] = [4096 x 1]. In short, each of the 9216 neurons will be connected to all 4096 neurons. That is why the layer is called a dense or a fully-connected layer.
As others have said it above, there is no hard rule about why this should be 4096. The dense layer just has to have enough number of neurons so as to capture variability of the entire dataset. The dataset under consideration - ImageNet 1K - is quite difficult and has 1000 categories. So 4096 neurons to start with do not seem too much.
No, 4096 is the dimensionality of the output of that layer, while the dimensionality of the input is 13x13x256. Both don't have to be equal as you see in the diagram.
I will show it by image, look the below image of network Alexnet
The layer 256 * 13 *13 will do max pooling operator then it will be 256 * 6 * 6=9216. Then will be flatten to connected to 4096 Fully connect network, so the parameters will be 9216 * 4096. You can see all the parameters computed in the below excel.
cited:
https://www.learnopencv.com/understanding-alexnet/
https://medium.com/#smallfishbigsea/a-walk-through-of-alexnet-6cbd137a5637
The output size of pooling layer is
output = (input size - window size) / (stride + 1)
in the above case the input size is 13, most implementations of pooling add an extra layer of padding in order to keep the boundary pixels in the calculations, so the input size will become 14.
the most common window size and stride is W = 2 and S = 2 so put them in the formula
output = (14 - 2) / (2 + 1)
output = 12 / 3
output = 4
now there will be 256 feature maps produced of size 4x4, flatten that out and you get
flatten = 4 x 4 x 256
flatten = 4096
Hope this answers your question.
I believe you want to know how the transition from a convolutional layer to a fully-connected, or dense layer, comes to be. You have to realize that, another way of viewing a convolutional layer is that it's a dense layer, but with sparse connections. This is explained in Goodfellow's book, Deep Learning, chapter 9.
Something similar applies with the output of a pooling operation, you just end up with something that resembles the output of a convolutional layer, but summarized. All the weights of all the convolutional kernels can then be connected to a fully-connected layer. This tipically entails in a first fully-connected layer that has many neurons, so you can use a second (or third) layer that will do the actual classification/regression.
As to the choice of the number of neurons in a dense layer that comes after a convolutional layer, there is no mathematical rule behind it, like the one with convolutional layers. Since the layer is fully connected, you are able to choose any size, just like in your typical multi-layer perceptron.
I'm using Lasagne to create a CNN for the MNIST dataset. I'm following closely to this example: Convolutional Neural Networks and Feature Extraction with Python.
The CNN architecture I have at the moment, which doesn't include any dropout layers, is:
NeuralNet(
layers=[('input', layers.InputLayer), # Input Layer
('conv2d1', layers.Conv2DLayer), # Convolutional Layer
('maxpool1', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('conv2d2', layers.Conv2DLayer), # Convolutional Layer
('maxpool2', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('dense', layers.DenseLayer), # Fully connected layer
('output', layers.DenseLayer), # Output Layer
],
# input layer
input_shape=(None, 1, 28, 28),
# layer conv2d1
conv2d1_num_filters=32,
conv2d1_filter_size=(5, 5),
conv2d1_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool1
maxpool1_pool_size=(2, 2),
# layer conv2d2
conv2d2_num_filters=32,
conv2d2_filter_size=(3, 3),
conv2d2_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool2
maxpool2_pool_size=(2, 2),
# Fully Connected Layer
dense_num_units=256,
dense_nonlinearity=lasagne.nonlinearities.rectify,
# output Layer
output_nonlinearity=lasagne.nonlinearities.softmax,
output_num_units=10,
# optimization method params
update= momentum,
update_learning_rate=0.01,
update_momentum=0.9,
max_epochs=10,
verbose=1,
)
This outputs the following Layer Information:
# name size
--- -------- --------
0 input 1x28x28
1 conv2d1 32x24x24
2 maxpool1 32x12x12
3 conv2d2 32x10x10
4 maxpool2 32x5x5
5 dense 256
6 output 10
and outputs the number of learnable parameters as 217,706
I'm wondering how this number is calculated? I've read a number of resources, including this StackOverflow's question, but none clearly generalizes the calculation.
If possible, can the calculation of the learnable parameters per layer be generalised?
For example, convolutional layer: number of filters x filter width x filter height.
Let's first look at how the number of learnable parameters is calculated for each individual type of layer you have, and then calculate the number of parameters in your example.
Input layer: All the input layer does is read the input image, so there are no parameters you could learn here.
Convolutional layers: Consider a convolutional layer which takes l feature maps at the input, and has k feature maps as output. The filter size is n x m. For example, this will look like this:
Here, the input has l=32 feature maps as input, k=64 feature maps as output, and the filter size is n=3 x m=3. It is important to understand, that we don't simply have a 3x3 filter, but actually a 3x3x32 filter, as our input has 32 dimensions. And we learn 64 different 3x3x32 filters.
Thus, the total number of weights is n*m*k*l.
Then, there is also a bias term for each feature map, so we have a total number of parameters of (n*m*l+1)*k.
Pooling layers: The pooling layers e.g. do the following: "replace a 2x2 neighborhood by its maximum value". So there is no parameter you could learn in a pooling layer.
Fully-connected layers: In a fully-connected layer, all input units have a separate weight to each output unit. For n inputs and m outputs, the number of weights is n*m. Additionally, you have a bias for each output node, so you are at (n+1)*m parameters.
Output layer: The output layer is a normal fully-connected layer, so (n+1)*m parameters, where n is the number of inputs and m is the number of outputs.
The final difficulty is the first fully-connected layer: we do not know the dimensionality of the input to that layer, as it is a convolutional layer. To calculate it, we have to start with the size of the input image, and calculate the size of each convolutional layer. In your case, Lasagne already calculates this for you and reports the sizes - which makes it easy for us. If you have to calculate the size of each layer yourself, it's a bit more complicated:
In the simplest case (like your example), the size of the output of a convolutional layer is input_size - (filter_size - 1), in your case: 28 - 4 = 24. This is due to the nature of the convolution: we use e.g. a 5x5 neighborhood to calculate a point - but the two outermost rows and columns don't have a 5x5 neighborhood, so we can't calculate any output for those points. This is why our output is 2*2=4 rows/columns smaller than the input.
If one doesn't want the output to be smaller than the input, one can zero-pad the image (with the pad parameter of the convolutional layer in Lasagne). E.g. if you add 2 rows/cols of zeros around the image, the output size will be (28+4)-4=28. So in case of padding, the output size is input_size + 2*padding - (filter_size -1).
If you explicitly want to downsample your image during the convolution, you can define a stride, e.g. stride=2, which means that you move the filter in steps of 2 pixels. Then, the expression becomes ((input_size + 2*padding - filter_size)/stride) +1.
In your case, the full calculations are:
# name size parameters
--- -------- ------------------------- ------------------------
0 input 1x28x28 0
1 conv2d1 (28-(5-1))=24 -> 32x24x24 (5*5*1+1)*32 = 832
2 maxpool1 32x12x12 0
3 conv2d2 (12-(3-1))=10 -> 32x10x10 (3*3*32+1)*32 = 9'248
4 maxpool2 32x5x5 0
5 dense 256 (32*5*5+1)*256 = 205'056
6 output 10 (256+1)*10 = 2'570
So in your network, you have a total of 832 + 9'248 + 205'056 + 2'570 = 217'706 learnable parameters, which is exactly what Lasagne reports.
building on top of #hbaderts's excellent reply, just came up with some formula for a I-C-P-C-P-H-O network (since i was working on a similar problem), sharing it in the figure below, may be helpful.
Also, (1) convolution layer with 2x2 stride and (2) convolution layer 1x1 stride + (max/avg) pooling with 2x2 stride, each contributes same numbers of parameters with 'same' padding, as can be seen below:
convolutional layers size is calculated=((n+2p-k)/s)+1
Here,
n is input p is padding k is kernel or filter s is stride
here in the above case
n=28 p=0 k=5 s=1
I encountered this problem while researching on image reconstruction of two-slab geometry. Suppose I have an array of 15 by 15 sources with spacing 1 on one plane, with an array of 57 by 57 detectors with spacing 0.5 on the other parallel plane. The total measurement data becomes hence a 4-d array of size 57 by 57 by 15 by 15. My first thing to do is to perform a 4-d (or more accurately double 2-d) Fourier transform to the data with respect to the detector lattice and source lattice respectively, and I want to use the built-in function fft2 in MATLAB. My code is as follows:
for s = 1:Ns
data(:,:,s) = fftshift(fft2(data(:,:,s))); % fft2 assumes exp(-i*omega*t)
end
data = reshape(data(:,:,1:Ns),[Nx,Ny,sqrt(Ns),sqrt(Ns)]);
for i = 1:Nx
for j = 1:Ny
data(i,j,:,:) = fftshift(fft2(squeeze(data(i,j,:,:))));
end
end
val = data;
In the above code, data is the measurement and is originally of size 57 by 57 by 225. Nx=Ny=57,Ns=225. Can anyone point out to me whether there is something going wrong in my above implementation of double 2-d FFT, or at least how I am able to verify it? My second question is about the frequency domain. According to doumentations of MATLAB, the frequency lattice with respect to detector plane should be (-28:28)*2*pi/57/0.5 (for both components), while the frequency lattice with respect to the source lattice should be (-7:7)*2*pi/15/1 (for both components also). Is that true? Can one say that the element at position (29,29,8,8) of val represents exactly the 0 frequency both for the detector frequency and for the source frequency? Thanks in advance!
The above image is from a pdf by Yann LeCun, titled "Hierarchical Models Of Perception and Reasoning"
I am not able to understand the how the layer 2 is 14X14 feature maps?
How can 75X75 matrix with 10X10 pooling and 5X5 subsampling gives 14X14 matrix ?
If you refer to this other paper by LeCun et al. an identical network is used with a larger input (143x143 grayscale image):
The first stage has 64 filters of size 9x9, followed by a subsampling layer with
5x5 stride, and 10x10 averaging window. [...]
This gives the right dimension:
output size = (input size - window size) / step + 1
= (75-10) / 5 + 1
= 14