Artificial Neural Network with output matrix - neural-network

I try to understand neural nets and I have the following problem.
I have an Image as the Input. Lets say 3x100x50. And I want that the output is a 100x50 Matrix.
I want to add Convolutional layer between input and output. For example:
Kernel_size: 10
stride: 5
zero-padding: 0
Filters: 4
I think, if I use this Convolutional layer on my input, i get 200 matrix which have a size of 4x10x10, connected to the input - right?
My question is how do I get one matrix of size 100x50 so that it would fit to my ouput?

Related

Dimensions of inputs to a fully connected layer from convolutional layer in a CNN

The question is on the mathematical details of the convolutional neural networks. Assume that the architecture of the net (objective of which is image classification) is as such
Input image 32x32
First hidden layer 3x28x28 (formed by convolving with 3 filters of
size 5x5, stride length = 0 and no padding), followed by
activation
Pooling layer (pooling over a 2x2 region) producing an output of
3x14x14
Second hidden layer 6x10x10 (formed by convolving with 6 filters
of size 5x5, stride length = 0 and no padding), followed by
activation
Pooling layer (pooling over a 2x2 region) producing an output of
6x5x5
Fully connected layer (FCN) -1 with 100 neurons
Fully connected layer (FCN) -2 with 10 neurons
From my readings thus far, I have understood that each of the 6x5x5 matrices are connected to the FCN-1. I have two questions, both of which are related to the way output from one layer is fed to another.
The output of the second pooling layer is 6x5x5. How are these fed to the FCN-1? What I mean is that each neuron in the FCN-1 can be seen as node that takes a scalar as input (or a 1x1 matrix). So how do we feed it an input of 6x5x5? I initially thought we’d flatten out the 6x5x5 matrices and convert it into a 150x1 array and then feed it to the neuron as if we have 150 training points. But doesn’t flattening out the feature map defeat the argument of spatial architecture of images?
From the first pooling layer we get 3 feature maps of size 14x14. How are the feature maps in the second layer generated? Lets say I look at the same region (a 5x5 area starting from the top left of the feature maps) across the 3 feature maps I get from the first convolutional layer. Are these three 5x5 patches used as separate training examples to produce the corresponding region in the next set of feature maps? If so then what if the three feature maps are instead RGB values of an input image? Would we still use them as separate training examples?
Generally what some CNN (like VGG 16 , VGG 19) do is, they flatten out the 3D tensor output from the MAX_POOL layer, so in your example the input to the FC layer would become (None,150), but other CNNs (like ResNet50 ) use a global max function to get 6x1x1 (dimension of output tensor) then which is flattened (would become (None,6)) and fed into FC layers.
This link has an image to a popular CNN architecture called VGG19.
To answer your query wherein flattening defeats spatial arrangement, when you flatten the image, lets say a pixel location is Xij (i.e ith row, jth column = n*i+j , where n is the width of the image) then based on matrix representation we can say that its upper neighbor is Xi-1,j (n*(i-1)+j) and so on for other neighbors, since there exists a co-relation for pixels and their neighboring pixels, the FC layer will automatically adjust weights to reflect that information.
Hence you can consider the convo->activation->pooling layers group as feature extraction layers whose output tensors (analogous to dimensions/features in vector) that will be fed into a standard ANN at the end of the network.

Understanding the dimensions of a fully-connected layer that follows a max-pooling layer [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
In the diagram (architecture) below, how was the (fully-connected) dense layer of 4096 units derived from last max-pool layer (on the right) of dimensions 256x13x13? Instead of 4096, shouldn't it be 256*13*13=43264 ?
If I'm correct, you're asking why the 4096x1x1 layer is much smaller.
That's because it's a fully connected layer. Every neuron from the last max-pooling layer (=256*13*13=43264 neurons) is connectd to every neuron of the fully-connected layer.
This is an example of an ALL to ALL connected neural network:
As you can see, layer2 is bigger than layer3. That doesn't mean they can't connect.
There is no conversion of the last max-pooling layer -> all the neurons in the max-pooling layer are just connected with all the 4096 neurons in the next layer.
The 'dense' operation just means calculate the weights and biases of all these connections (= 4096 * 43264 connections) and add the bias of the neurons to calculate the next output.
It's connected the same was an MLP.
But why 4096? There is no reasoning. It's just a choice. It could have been 8000, it could have been 20, it just depends on what works best for the network.
You are right in that the last convolutional layer has 256 x 13 x 13 = 43264 neurons. However, there is a max-pooling layer with stride = 3 and pool_size = 2. This will produce an output of size 256 x 6 x 6. You connect this to a fully-connected layer. In order to do that, you first have to flatten the output, which will take the shape - 256 x 6 x 6 = 9216 x 1. To map 9216 neurons to 4096 neurons, we introduce a 9216 x 4096 weight matrix as the weight of dense/fully-connected layer. Therefore, w^T * x = [9216 x 4096]^T * [9216 x 1] = [4096 x 1]. In short, each of the 9216 neurons will be connected to all 4096 neurons. That is why the layer is called a dense or a fully-connected layer.
As others have said it above, there is no hard rule about why this should be 4096. The dense layer just has to have enough number of neurons so as to capture variability of the entire dataset. The dataset under consideration - ImageNet 1K - is quite difficult and has 1000 categories. So 4096 neurons to start with do not seem too much.
No, 4096 is the dimensionality of the output of that layer, while the dimensionality of the input is 13x13x256. Both don't have to be equal as you see in the diagram.
I will show it by image, look the below image of network Alexnet
The layer 256 * 13 *13 will do max pooling operator then it will be 256 * 6 * 6=9216. Then will be flatten to connected to 4096 Fully connect network, so the parameters will be 9216 * 4096. You can see all the parameters computed in the below excel.
cited:
https://www.learnopencv.com/understanding-alexnet/
https://medium.com/#smallfishbigsea/a-walk-through-of-alexnet-6cbd137a5637
The output size of pooling layer is
output = (input size - window size) / (stride + 1)
in the above case the input size is 13, most implementations of pooling add an extra layer of padding in order to keep the boundary pixels in the calculations, so the input size will become 14.
the most common window size and stride is W = 2 and S = 2 so put them in the formula
output = (14 - 2) / (2 + 1)
output = 12 / 3
output = 4
now there will be 256 feature maps produced of size 4x4, flatten that out and you get
flatten = 4 x 4 x 256
flatten = 4096
Hope this answers your question.
I believe you want to know how the transition from a convolutional layer to a fully-connected, or dense layer, comes to be. You have to realize that, another way of viewing a convolutional layer is that it's a dense layer, but with sparse connections. This is explained in Goodfellow's book, Deep Learning, chapter 9.
Something similar applies with the output of a pooling operation, you just end up with something that resembles the output of a convolutional layer, but summarized. All the weights of all the convolutional kernels can then be connected to a fully-connected layer. This tipically entails in a first fully-connected layer that has many neurons, so you can use a second (or third) layer that will do the actual classification/regression.
As to the choice of the number of neurons in a dense layer that comes after a convolutional layer, there is no mathematical rule behind it, like the one with convolutional layers. Since the layer is fully connected, you are able to choose any size, just like in your typical multi-layer perceptron.

How to calculate the number of parameters for convolutional neural network?

I'm using Lasagne to create a CNN for the MNIST dataset. I'm following closely to this example: Convolutional Neural Networks and Feature Extraction with Python.
The CNN architecture I have at the moment, which doesn't include any dropout layers, is:
NeuralNet(
layers=[('input', layers.InputLayer), # Input Layer
('conv2d1', layers.Conv2DLayer), # Convolutional Layer
('maxpool1', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('conv2d2', layers.Conv2DLayer), # Convolutional Layer
('maxpool2', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('dense', layers.DenseLayer), # Fully connected layer
('output', layers.DenseLayer), # Output Layer
],
# input layer
input_shape=(None, 1, 28, 28),
# layer conv2d1
conv2d1_num_filters=32,
conv2d1_filter_size=(5, 5),
conv2d1_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool1
maxpool1_pool_size=(2, 2),
# layer conv2d2
conv2d2_num_filters=32,
conv2d2_filter_size=(3, 3),
conv2d2_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool2
maxpool2_pool_size=(2, 2),
# Fully Connected Layer
dense_num_units=256,
dense_nonlinearity=lasagne.nonlinearities.rectify,
# output Layer
output_nonlinearity=lasagne.nonlinearities.softmax,
output_num_units=10,
# optimization method params
update= momentum,
update_learning_rate=0.01,
update_momentum=0.9,
max_epochs=10,
verbose=1,
)
This outputs the following Layer Information:
# name size
--- -------- --------
0 input 1x28x28
1 conv2d1 32x24x24
2 maxpool1 32x12x12
3 conv2d2 32x10x10
4 maxpool2 32x5x5
5 dense 256
6 output 10
and outputs the number of learnable parameters as 217,706
I'm wondering how this number is calculated? I've read a number of resources, including this StackOverflow's question, but none clearly generalizes the calculation.
If possible, can the calculation of the learnable parameters per layer be generalised?
For example, convolutional layer: number of filters x filter width x filter height.
Let's first look at how the number of learnable parameters is calculated for each individual type of layer you have, and then calculate the number of parameters in your example.
Input layer: All the input layer does is read the input image, so there are no parameters you could learn here.
Convolutional layers: Consider a convolutional layer which takes l feature maps at the input, and has k feature maps as output. The filter size is n x m. For example, this will look like this:
Here, the input has l=32 feature maps as input, k=64 feature maps as output, and the filter size is n=3 x m=3. It is important to understand, that we don't simply have a 3x3 filter, but actually a 3x3x32 filter, as our input has 32 dimensions. And we learn 64 different 3x3x32 filters.
Thus, the total number of weights is n*m*k*l.
Then, there is also a bias term for each feature map, so we have a total number of parameters of (n*m*l+1)*k.
Pooling layers: The pooling layers e.g. do the following: "replace a 2x2 neighborhood by its maximum value". So there is no parameter you could learn in a pooling layer.
Fully-connected layers: In a fully-connected layer, all input units have a separate weight to each output unit. For n inputs and m outputs, the number of weights is n*m. Additionally, you have a bias for each output node, so you are at (n+1)*m parameters.
Output layer: The output layer is a normal fully-connected layer, so (n+1)*m parameters, where n is the number of inputs and m is the number of outputs.
The final difficulty is the first fully-connected layer: we do not know the dimensionality of the input to that layer, as it is a convolutional layer. To calculate it, we have to start with the size of the input image, and calculate the size of each convolutional layer. In your case, Lasagne already calculates this for you and reports the sizes - which makes it easy for us. If you have to calculate the size of each layer yourself, it's a bit more complicated:
In the simplest case (like your example), the size of the output of a convolutional layer is input_size - (filter_size - 1), in your case: 28 - 4 = 24. This is due to the nature of the convolution: we use e.g. a 5x5 neighborhood to calculate a point - but the two outermost rows and columns don't have a 5x5 neighborhood, so we can't calculate any output for those points. This is why our output is 2*2=4 rows/columns smaller than the input.
If one doesn't want the output to be smaller than the input, one can zero-pad the image (with the pad parameter of the convolutional layer in Lasagne). E.g. if you add 2 rows/cols of zeros around the image, the output size will be (28+4)-4=28. So in case of padding, the output size is input_size + 2*padding - (filter_size -1).
If you explicitly want to downsample your image during the convolution, you can define a stride, e.g. stride=2, which means that you move the filter in steps of 2 pixels. Then, the expression becomes ((input_size + 2*padding - filter_size)/stride) +1.
In your case, the full calculations are:
# name size parameters
--- -------- ------------------------- ------------------------
0 input 1x28x28 0
1 conv2d1 (28-(5-1))=24 -> 32x24x24 (5*5*1+1)*32 = 832
2 maxpool1 32x12x12 0
3 conv2d2 (12-(3-1))=10 -> 32x10x10 (3*3*32+1)*32 = 9'248
4 maxpool2 32x5x5 0
5 dense 256 (32*5*5+1)*256 = 205'056
6 output 10 (256+1)*10 = 2'570
So in your network, you have a total of 832 + 9'248 + 205'056 + 2'570 = 217'706 learnable parameters, which is exactly what Lasagne reports.
building on top of #hbaderts's excellent reply, just came up with some formula for a I-C-P-C-P-H-O network (since i was working on a similar problem), sharing it in the figure below, may be helpful.
Also, (1) convolution layer with 2x2 stride and (2) convolution layer 1x1 stride + (max/avg) pooling with 2x2 stride, each contributes same numbers of parameters with 'same' padding, as can be seen below:
convolutional layers size is calculated=((n+2p-k)/s)+1
Here,
n is input p is padding k is kernel or filter s is stride
here in the above case
n=28 p=0 k=5 s=1

For what reason Convolution 1x1 is used in deep neural networks?

I'm looking at InceptionV3 (GoogLeNet) architecture and cannot understand why do we need conv1x1 layers?
I know how convolution works, but I see a profit with patch size > 1.
You can think about 1x1xD convolution as a dimensionality reduction technique when it's placed somewhere into a network.
If you have an input volume of 100x100x512 and you convolve it with a set of D filters each one with size 1x1x512 you reduce the number of features from 512 to D.
The output volume is, therefore, 100x100xD.
As you can see this (1x1x512)xD convolution is mathematically equivalent to a fully connected layer. The main difference is that whilst FC layer requires the input to have a fixed size, the convolutional layer can accept in input every volume with spatial extent greater or equal than 100x100.
A 1x1xD convolution can substitute any fully connected layer because of this equivalence.
In addition, 1x1xD convolutions not only reduce the features in input to the next layer, but also introduces new parameters and new non-linearity into the network that will help to increase model accuracy.
When the 1x1xD convolution is placed at the end of a classification network, it acts exactly as a FC layer, but instead of thinking about it as a dimensionality reduction technique it's more intuitive to think about it as a layer that will output a tensor with shape WxHxnum_classes.
The spatial extent of the output tensor (identified by W and H) is dynamic and is determined by the locations of the input image that the network analyzed.
If the network has been defined with an input of 200x200x3 and we give it in input an image with this size, the output will be a map with W = H = 1 and depth = num_classes.
But, if the input image have a spatial extent greater than 200x200 than the convolutional network will analyze different locations of the input image (just like a standard convolution does) and will produce a tensor with W > 1 and H > 1.
This is not possibile with a FC layer that constrains the network to accept fixed size input and produce fixed size output.
A 1x1 convolution simply maps in input pixel to an output pixel, not looking at anything around itself. It is often used to reduce the number of depth channels, since it is often very slow to multiply volumes with extremely large depths.
input (256 depth) -> 1x1 convolution (64 depth) -> 4x4 convolution (256 depth)
input (256 depth) -> 4x4 convolution (256 depth)
The bottom one is about ~3.7x slower.
Theoretically the neural network can 'choose' which input 'colors' to look at using this, instead of brute force multiplying everything.

Matlab Neural Network Structure

I'm full newbie in neural networks. I generated NN in matlab. Further I need to know exact structure of this NN, because I need to implement it in Java(static connections and weights, no learning). Can you explain how to connect neurons and what math operations perform in each element?
NN params are next(taken from Matlab):
iw{1,1} - Weight to layer 1 from intput 1
[2.8574 -1.9207;
1.7582 -1.2549;
-4.5925 0.23236;
12.0861 12.3701;
2.503 -1.9321;
-2.1422 2.6928]
lw{2,1} - Weight to layer
[-0.51977 5.3993 3.4349 5.2863 3.1976 -0.67102]
b{1} - Bias to layer 1
[-3.2811;
-6.956;
-3.0943;
11.1103;
0.14842;
-3.3705]
b{2} - Bias to layer 2
[1.4657]
Transfer function TANSIG
Greatly appreciate your help.
You have a NN that has 2 inputs, then a hidden layer of 6 neurons and an output layer of 1 neuron.
Each of the neuron in each layer, will take all the outputs from the previous one and multiply them by a number and offset the result by another.
The numbers you show are the numbers I mentioned.
For example, the neuron 1 from hidden layer will output hidden1=2.8574*in1 -1.9207*in2-3.2811. Then take whatever sigma function you are using and apply hidden1=sigma(hidden1).
As another example, the output will be out=-hidden1*0.51977+hidden2*5.3993+...-hidden6*0.67102+1.4657