Keras input explanation: input_shape, units, batch_size, dim, etc - neural-network

For any Keras layer (Layer class), can someone explain how to understand the difference between input_shape, units, dim, etc.?
For example the doc says units specify the output shape of a layer.
In the image of the neural net below hidden layer1 has 4 units. Does this directly translate to the units attribute of the Layer object? Or does units in Keras equal the shape of every weight in the hidden layer times the number of units?
In short how does one understand/visualize the attributes of the model - in particular the layers - with the image below?

Units:
The amount of "neurons", or "cells", or whatever the layer has inside it.
It's a property of each layer, and yes, it's related to the output shape (as we will see later). In your picture, except for the input layer, which is conceptually different from other layers, you have:
Hidden layer 1: 4 units (4 neurons)
Hidden layer 2: 4 units
Last layer: 1 unit
Shapes
Shapes are consequences of the model's configuration. Shapes are tuples representing how many elements an array or tensor has in each dimension.
Ex: a shape (30,4,10) means an array or tensor with 3 dimensions, containing 30 elements in the first dimension, 4 in the second and 10 in the third, totaling 30*4*10 = 1200 elements or numbers.
The input shape
What flows between layers are tensors. Tensors can be seen as matrices, with shapes.
In Keras, the input layer itself is not a layer, but a tensor. It's the starting tensor you send to the first hidden layer. This tensor must have the same shape as your training data.
Example: if you have 30 images of 50x50 pixels in RGB (3 channels), the shape of your input data is (30,50,50,3). Then your input layer tensor, must have this shape (see details in the "shapes in keras" section).
Each type of layer requires the input with a certain number of dimensions:
Dense layers require inputs as (batch_size, input_size)
or (batch_size, optional,...,optional, input_size)
2D convolutional layers need inputs as:
if using channels_last: (batch_size, imageside1, imageside2, channels)
if using channels_first: (batch_size, channels, imageside1, imageside2)
1D convolutions and recurrent layers use (batch_size, sequence_length, features)
Details on how to prepare data for recurrent layers
Now, the input shape is the only one you must define, because your model cannot know it. Only you know that, based on your training data.
All the other shapes are calculated automatically based on the units and particularities of each layer.
Relation between shapes and units - The output shape
Given the input shape, all other shapes are results of layers calculations.
The "units" of each layer will define the output shape (the shape of the tensor that is produced by the layer and that will be the input of the next layer).
Each type of layer works in a particular way. Dense layers have output shape based on "units", convolutional layers have output shape based on "filters". But it's always based on some layer property. (See the documentation for what each layer outputs)
Let's show what happens with "Dense" layers, which is the type shown in your graph.
A dense layer has an output shape of (batch_size,units). So, yes, units, the property of the layer, also defines the output shape.
Hidden layer 1: 4 units, output shape: (batch_size,4).
Hidden layer 2: 4 units, output shape: (batch_size,4).
Last layer: 1 unit, output shape: (batch_size,1).
Weights
Weights will be entirely automatically calculated based on the input and the output shapes. Again, each type of layer works in a certain way. But the weights will be a matrix capable of transforming the input shape into the output shape by some mathematical operation.
In a dense layer, weights multiply all inputs. It's a matrix with one column per input and one row per unit, but this is often not important for basic works.
In the image, if each arrow had a multiplication number on it, all numbers together would form the weight matrix.
Shapes in Keras
Earlier, I gave an example of 30 images, 50x50 pixels and 3 channels, having an input shape of (30,50,50,3).
Since the input shape is the only one you need to define, Keras will demand it in the first layer.
But in this definition, Keras ignores the first dimension, which is the batch size. Your model should be able to deal with any batch size, so you define only the other dimensions:
input_shape = (50,50,3)
#regardless of how many images I have, each image has this shape
Optionally, or when it's required by certain kinds of models, you can pass the shape containing the batch size via batch_input_shape=(30,50,50,3) or batch_shape=(30,50,50,3). This limits your training possibilities to this unique batch size, so it should be used only when really required.
Either way you choose, tensors in the model will have the batch dimension.
So, even if you used input_shape=(50,50,3), when keras sends you messages, or when you print the model summary, it will show (None,50,50,3).
The first dimension is the batch size, it's None because it can vary depending on how many examples you give for training. (If you defined the batch size explicitly, then the number you defined will appear instead of None)
Also, in advanced works, when you actually operate directly on the tensors (inside Lambda layers or in the loss function, for instance), the batch size dimension will be there.
So, when defining the input shape, you ignore the batch size: input_shape=(50,50,3)
When doing operations directly on tensors, the shape will be again (30,50,50,3)
When keras sends you a message, the shape will be (None,50,50,3) or (30,50,50,3), depending on what type of message it sends you.
Dim
And in the end, what is dim?
If your input shape has only one dimension, you don't need to give it as a tuple, you give input_dim as a scalar number.
So, in your model, where your input layer has 3 elements, you can use any of these two:
input_shape=(3,) -- The comma is necessary when you have only one dimension
input_dim = 3
But when dealing directly with the tensors, often dim will refer to how many dimensions a tensor has. For instance a tensor with shape (25,10909) has 2 dimensions.
Defining your image in Keras
Keras has two ways of doing it, Sequential models, or the functional API Model. I don't like using the sequential model, later you will have to forget it anyway because you will want models with branches.
PS: here I ignored other aspects, such as activation functions.
With the Sequential model:
from keras.models import Sequential
from keras.layers import *
model = Sequential()
#start from the first hidden layer, since the input is not actually a layer
#but inform the shape of the input, with 3 elements.
model.add(Dense(units=4,input_shape=(3,))) #hidden layer 1 with input
#further layers:
model.add(Dense(units=4)) #hidden layer 2
model.add(Dense(units=1)) #output layer
With the functional API Model:
from keras.models import Model
from keras.layers import *
#Start defining the input tensor:
inpTensor = Input((3,))
#create the layers and pass them the input tensor to get the output tensor:
hidden1Out = Dense(units=4)(inpTensor)
hidden2Out = Dense(units=4)(hidden1Out)
finalOut = Dense(units=1)(hidden2Out)
#define the model's start and end points
model = Model(inpTensor,finalOut)
Shapes of the tensors
Remember you ignore batch sizes when defining layers:
inpTensor: (None,3)
hidden1Out: (None,4)
hidden2Out: (None,4)
finalOut: (None,1)

Input Dimension Clarified:
Not a direct answer, but I just realized that the term "Input Dimension" could be confusing, so be wary:
The word "dimension" alone can refer to:
a) The dimension of Input Data (or stream) such as # N of sensor axes to beam the time series signal, or RGB color channels (3):  suggested term = "Input Stream Dimension"
b) The total number / length of Input Features (or Input layer) (28 x 28 = 784 for the MINST color image) or 3000 in the FFT transformed Spectrum Values, or
"Input Layer / Input Feature Dimension"
c) The dimensionality (# of dimensions) of the input (typically 3D as expected in Keras LSTM) or (# of Rows of Samples, # of Sensors, # of Values..) 3 is the answer.
"N Dimensionality of Input"
d) The SPECIFIC Input Shape (eg. (30,50,50,3) in this unwrapped input image data, or (30, 2500, 3) if unwrapped
Keras:    
In Keras, input_dim refers to the Dimension of Input Layer / Number of Input Features
    model = Sequential()
    model.add(Dense(32, input_dim=784))  #or 3 in the current posted example above
    model.add(Activation('relu'))
In Keras LSTM, it refers to the total Time Steps
The term has been very confusing, we live in a very confusing world!!
I find one of the challenge in Machine Learning is to deal with different languages or dialects and terminologies (like if you have 5-8 highly different versions of English, then you need a very high proficiency to converse with different speakers). Probably this is the same in programming languages too.

Added this answer to elaborate on the input shape at the first layer.
I created tow variation of the same layers
Case 1:
model =Sequential()
model.add(Dense(15, input_shape=(5,3),activation="relu", kernel_initializer="he_uniform", kernel_regularizer=None,kernel_constraint="MaxNorm"))
model.add(Dense(32,activation="relu"))
model.add(Dense(8))
Case 2:
model1=Sequential()
model1.add(Dense(15,input_shape=(15,),kernel_initializer="he_uniform",kernel_constraint="MaxNorm",kernel_regularizer=None,activation="relu"))
model1.add(Dense(32,activation="relu"))
model1.add(Dense(8))
plot_model(model1,show_shapes=True)
Now if plot these and take summary,-
Case 1
[![Case1 Model Summary][2]][2]
[2]: https://i.stack.imgur.com/WXh9z.png
Case 2
summary
Now if you look closely , in the first case , input is two dimensional. Output of first layer generates one for each row x number of units.
Case two is simpler , there is not such complexity each unit produces one output after activation.

Related

Is a fully connected layer equivalent to Flatten + Dense in Tensorflow?

A fully-connected layer, also known as a dense layer, refers to the layer whose inside neurons connect to every neuron in the preceding layer (see Wikipedia).
In the MATLAB Deep Learning Toolkit, when defining a fullyConnectedLayer(n), the output will always be (borrowing the terminology from Tensorflow) a "tensor" of shape 1×1×n.
However, defining a dense layer in Keras via tf.keras.layers.Dense(n) will not result in a rank 1 tensor depending on the input, as explained in the Keras documentation:
For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units).
Am I correct in assuming that what MATLAB does in fullyConnectedLayer(n) is equivalent to cascading a Flatten() layer and a Dense(n) layer in Tensorflow? By equivalent I mean that exactly the same operation is performed.
It would appear that this is the case based on the number of weights that MATLAB requires for a fullyConnectedLayer. The weights in fact are n×M where M is the dimension of the input (see MATLAB Documentation: "At training time, Weights is an OutputSize-by-InputSize matrix"). In fact snooping around the internals of this MATLAB function, it seems to me that the InputSize is precisely the size of the input if it were "flattened", i.e. M = a*b*c if the input tensor has shape (a,b,c) (and of course I experimentally verified this by multiplying).
The layer I'm trying to build is towards the final stages of a categorical classifiers, so I need the final output of the Keras model to be of shape (None, n) where n is the number of labels in the training data.

Understanding 3D convolution and when to use it?

I am new to convolutional neural networks, and I am learning 3D convolution.
What I could understand is that 2D convolution gives us relationships between low level features in the X-Y dimension, while the 3D convolution helps detect low level features and relationships between them in all the 3 dimensions.
Consider a CNN employing 2D conv layers to recognize hand written digits. If a digit, say 5, was written in different colors:
Would a strictly 2D CNN would perform poorly (since they belong to different channels in the z dimension)?
Also, are there practical well-known neural nets that employ 3D convolution?
The problem is that the 2D aspects of an image have locality. In a sense, things that are nearby are expected to be related in some fundamental way. E.g. a pixel near a hair pixel is expected to be a hair pixel, a priori. However, the different channels have no such relationship. When you only have 3 channels, a 3D convolution is equivalent to being fully connected in z. When you have 27 channels (e.g. in the middle of the net), why would any 3 channels be considered "close" to each other?
This answer explains the difference nicely.
Doing a "fully-connected" relationship over the channels is what most libraries do by default. Note this line in particular: "...a filter / kernel tensor of shape [filter_height, filter_width, in_channels, out_channels]". For an input vector of size in_channels, a matrix of size [in_channels, out_channels] is fully-connected. So, the filter can be thought of as a fully-connected layer on a "patch" of image size [filter_height, filter_width].
To illustrate, on a single channel, a regular plain old image filter takes a patch of image and maps that patch to a single pixel in a new image. Like so: (image credit)
On the other hand, suppose that we have multiple channels. Instead of performing a linear mapping from a 3x3 patch to a 1x1 pixel, we perform a linear mapping from a 3x3xin_channels patch to a 1x1xout_channels set of pixels. How do we do this? Well, a linear mapping is just a matrix. Note that a 3x3xin_channels patch can be written as a vector with 3*3*in_channels entries. A 1x1xout_channels set of pixels can be written as a vector with out_channels entries. A linear mapping between the two is given by a matrix with 3*3*in_channels rows and out_channels columns. The entries of that matrix are the parameters of that layer of the network. The layer works by simply multiplying the in vector by the matrix of weights to get the out vector. This is repeated over all patches of an image. (Actually, instead of doing this in a loop over all patches, you can achieve an equivalent thing with some fanciness which is what libraries do in practice, but it gives the same result)
To illustrate, the mapping takes this 3x3xin_channels column:
To this 1x1xout_channels stack of pixels:
Now, what you are proposing is to do something with the following bit:
There is no mathematical reason why you can't do something with that 3x3x3 patch containing only 3 channels of your whole set of in_channels. However, whatever 3 channels you choose is totally arbitrary, and they have no intrinsic relationship to one another that would suggest that treating them as being "nearby" would help.
To reiterate, in an image, the pixels that are near each other are expected to be "similar" or "related" in some sense. This is why a convolution works at all. If you jumbled up the pixels and then did a convolution, it would be worthless. On that note, all of the channels are just a jumble. There is no "nearby relatedness" property along the channels. E.g. the "red" channel isn't near the "green" channel OR the "blue" channel, because "nearness" doesn't make any sense between the channels. Since "nearness" isn't a property of the channel dimension, then doing a convolution in that dimension probably isn't going to be useful.
On the other hand, we can simply take the input of ALL of the in_channels to generate the output from ALL of the out_channels simultaneously, and let them influence each other in a linear sort of way. Note that the linear transformation described involves a sort of cross-pollination of the parameters. For example, for a layer at the top of the network, taking in a 3x3 patch of r,g,b channels labeled r_1_1-r_3_3 etc., a single pixel in a single channel of the output from that patch would look like:
A*r_1_1 + B*r_1_2 + ... C*r_3_3 + D*b_1_1 + E*b_1_2 + ... F*b_3_3 + G*g_1_1 + ...
Where the capital letters are entries of the weight matrix.
So your observation: "Would a strictly 2D CNN would perform poorly?" is based on an assumption that the convolutional layer doesn't include any "mixing" between the various channels. This is not the case. The in_channels are ALL combined in a linear mapping to obtain the out_channels.

Dimensions of inputs to a fully connected layer from convolutional layer in a CNN

The question is on the mathematical details of the convolutional neural networks. Assume that the architecture of the net (objective of which is image classification) is as such
Input image 32x32
First hidden layer 3x28x28 (formed by convolving with 3 filters of
size 5x5, stride length = 0 and no padding), followed by
activation
Pooling layer (pooling over a 2x2 region) producing an output of
3x14x14
Second hidden layer 6x10x10 (formed by convolving with 6 filters
of size 5x5, stride length = 0 and no padding), followed by
activation
Pooling layer (pooling over a 2x2 region) producing an output of
6x5x5
Fully connected layer (FCN) -1 with 100 neurons
Fully connected layer (FCN) -2 with 10 neurons
From my readings thus far, I have understood that each of the 6x5x5 matrices are connected to the FCN-1. I have two questions, both of which are related to the way output from one layer is fed to another.
The output of the second pooling layer is 6x5x5. How are these fed to the FCN-1? What I mean is that each neuron in the FCN-1 can be seen as node that takes a scalar as input (or a 1x1 matrix). So how do we feed it an input of 6x5x5? I initially thought we’d flatten out the 6x5x5 matrices and convert it into a 150x1 array and then feed it to the neuron as if we have 150 training points. But doesn’t flattening out the feature map defeat the argument of spatial architecture of images?
From the first pooling layer we get 3 feature maps of size 14x14. How are the feature maps in the second layer generated? Lets say I look at the same region (a 5x5 area starting from the top left of the feature maps) across the 3 feature maps I get from the first convolutional layer. Are these three 5x5 patches used as separate training examples to produce the corresponding region in the next set of feature maps? If so then what if the three feature maps are instead RGB values of an input image? Would we still use them as separate training examples?
Generally what some CNN (like VGG 16 , VGG 19) do is, they flatten out the 3D tensor output from the MAX_POOL layer, so in your example the input to the FC layer would become (None,150), but other CNNs (like ResNet50 ) use a global max function to get 6x1x1 (dimension of output tensor) then which is flattened (would become (None,6)) and fed into FC layers.
This link has an image to a popular CNN architecture called VGG19.
To answer your query wherein flattening defeats spatial arrangement, when you flatten the image, lets say a pixel location is Xij (i.e ith row, jth column = n*i+j , where n is the width of the image) then based on matrix representation we can say that its upper neighbor is Xi-1,j (n*(i-1)+j) and so on for other neighbors, since there exists a co-relation for pixels and their neighboring pixels, the FC layer will automatically adjust weights to reflect that information.
Hence you can consider the convo->activation->pooling layers group as feature extraction layers whose output tensors (analogous to dimensions/features in vector) that will be fed into a standard ANN at the end of the network.

Calculating size of output of a Conv layer in CNN model

In convolutional Neural Networks, How to know the output of a specific conv layer? (I am using keras to build a CNN model)
For example if I am using one dimensional conv layer, where number_of_filters=20, kernel_size=10, and input_shape(500,1)
cnn.add(Conv1D(20,kernel_size=10,strides=1, padding="same",activation="sigmoid",input_shape=(Dimension_of_input,1)))
and if I am using two dimensional conv layer, where number_of_filters=64, kernal_size=(5,100), input_shape= (5,720,1) (height,width,channel)
Conv2D(64, (5, 100),
padding="same",
activation="sigmoid",
data_format="channels_last",
input_shape=(5,720,1)
what is the number of output in the above two conv layers? Is there any equation that can be used to know the number of outputs of a conv layer in convolution neural network?
Yes, there are equations for it, you can find them in the CS231N course website. But as this is a programming site, Keras provides an easy way to get this information programmaticaly, by using the summary function of a Model.
model = Sequential()
fill model with layers
model.summary()
This will print in terminal/console all the layer information, such as input shapes, output shapes, and number of parameters for each layer.
Actually, the model.summary() function might not be what you are looking for if you want to do more than just look at the model.
If you want to access layers of your Keras model you can do this by using model.layers which returns all of the layers (assignement stores them as a list). If you then want to look at a specific layer you can simply index the list:
list_of_layers = model.layers
list_of_layers[5] # gives you the 6th layer
What you are still working with are just objects so you probably want to get specific values. You just have to specify attribute you want to look at then:
list_of_layers[-1].output_shape # returns output_shape of last layer
Gives you back the output_shape tuple of the last layer in the model.
You can even skip the whole list assignement thing if you already know that you only want to look at the output_shape of a certain layer and just do:
model.layers[-1].output_shape # equivalent to the above method without storing in a list
This might be useful if you want to use these values while building the model to guide the execution in a certain way (adding a pooling layer or doing the padding etc.).
when first time i am working with TensorFlow cnn it is very difficult to dealing with dimensions. below is the general scenario for calculating dimensions:
consider
we have a image of dimension (nXn), filter dimension : (fXf), no padding, no strides applies :
after convolution dimension are : (n-f+1,n-f+1)
dimension of image = (nXn) and filter dimension = (fXf) and we have padding : p
then output dims are = (n+2P-f+1,n+2P-f+1)
if we are using Padding = 'SAME" it means output dims = input dims in this case equation looks like : n+2P-f+1=n
so from here p = (f-1)/2
if we are using valid padding then it means no padding and p =0
in computer vision f is usually odd if f is even it means we have asymmetric padding.
case when we are using stride = s
output dims are ( floor( ((n+2P-f)/s)+1 ),floor( ( (n+2P-f)/s)+1 ) )

How to calculate the number of parameters for convolutional neural network?

I'm using Lasagne to create a CNN for the MNIST dataset. I'm following closely to this example: Convolutional Neural Networks and Feature Extraction with Python.
The CNN architecture I have at the moment, which doesn't include any dropout layers, is:
NeuralNet(
layers=[('input', layers.InputLayer), # Input Layer
('conv2d1', layers.Conv2DLayer), # Convolutional Layer
('maxpool1', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('conv2d2', layers.Conv2DLayer), # Convolutional Layer
('maxpool2', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('dense', layers.DenseLayer), # Fully connected layer
('output', layers.DenseLayer), # Output Layer
],
# input layer
input_shape=(None, 1, 28, 28),
# layer conv2d1
conv2d1_num_filters=32,
conv2d1_filter_size=(5, 5),
conv2d1_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool1
maxpool1_pool_size=(2, 2),
# layer conv2d2
conv2d2_num_filters=32,
conv2d2_filter_size=(3, 3),
conv2d2_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool2
maxpool2_pool_size=(2, 2),
# Fully Connected Layer
dense_num_units=256,
dense_nonlinearity=lasagne.nonlinearities.rectify,
# output Layer
output_nonlinearity=lasagne.nonlinearities.softmax,
output_num_units=10,
# optimization method params
update= momentum,
update_learning_rate=0.01,
update_momentum=0.9,
max_epochs=10,
verbose=1,
)
This outputs the following Layer Information:
# name size
--- -------- --------
0 input 1x28x28
1 conv2d1 32x24x24
2 maxpool1 32x12x12
3 conv2d2 32x10x10
4 maxpool2 32x5x5
5 dense 256
6 output 10
and outputs the number of learnable parameters as 217,706
I'm wondering how this number is calculated? I've read a number of resources, including this StackOverflow's question, but none clearly generalizes the calculation.
If possible, can the calculation of the learnable parameters per layer be generalised?
For example, convolutional layer: number of filters x filter width x filter height.
Let's first look at how the number of learnable parameters is calculated for each individual type of layer you have, and then calculate the number of parameters in your example.
Input layer: All the input layer does is read the input image, so there are no parameters you could learn here.
Convolutional layers: Consider a convolutional layer which takes l feature maps at the input, and has k feature maps as output. The filter size is n x m. For example, this will look like this:
Here, the input has l=32 feature maps as input, k=64 feature maps as output, and the filter size is n=3 x m=3. It is important to understand, that we don't simply have a 3x3 filter, but actually a 3x3x32 filter, as our input has 32 dimensions. And we learn 64 different 3x3x32 filters.
Thus, the total number of weights is n*m*k*l.
Then, there is also a bias term for each feature map, so we have a total number of parameters of (n*m*l+1)*k.
Pooling layers: The pooling layers e.g. do the following: "replace a 2x2 neighborhood by its maximum value". So there is no parameter you could learn in a pooling layer.
Fully-connected layers: In a fully-connected layer, all input units have a separate weight to each output unit. For n inputs and m outputs, the number of weights is n*m. Additionally, you have a bias for each output node, so you are at (n+1)*m parameters.
Output layer: The output layer is a normal fully-connected layer, so (n+1)*m parameters, where n is the number of inputs and m is the number of outputs.
The final difficulty is the first fully-connected layer: we do not know the dimensionality of the input to that layer, as it is a convolutional layer. To calculate it, we have to start with the size of the input image, and calculate the size of each convolutional layer. In your case, Lasagne already calculates this for you and reports the sizes - which makes it easy for us. If you have to calculate the size of each layer yourself, it's a bit more complicated:
In the simplest case (like your example), the size of the output of a convolutional layer is input_size - (filter_size - 1), in your case: 28 - 4 = 24. This is due to the nature of the convolution: we use e.g. a 5x5 neighborhood to calculate a point - but the two outermost rows and columns don't have a 5x5 neighborhood, so we can't calculate any output for those points. This is why our output is 2*2=4 rows/columns smaller than the input.
If one doesn't want the output to be smaller than the input, one can zero-pad the image (with the pad parameter of the convolutional layer in Lasagne). E.g. if you add 2 rows/cols of zeros around the image, the output size will be (28+4)-4=28. So in case of padding, the output size is input_size + 2*padding - (filter_size -1).
If you explicitly want to downsample your image during the convolution, you can define a stride, e.g. stride=2, which means that you move the filter in steps of 2 pixels. Then, the expression becomes ((input_size + 2*padding - filter_size)/stride) +1.
In your case, the full calculations are:
# name size parameters
--- -------- ------------------------- ------------------------
0 input 1x28x28 0
1 conv2d1 (28-(5-1))=24 -> 32x24x24 (5*5*1+1)*32 = 832
2 maxpool1 32x12x12 0
3 conv2d2 (12-(3-1))=10 -> 32x10x10 (3*3*32+1)*32 = 9'248
4 maxpool2 32x5x5 0
5 dense 256 (32*5*5+1)*256 = 205'056
6 output 10 (256+1)*10 = 2'570
So in your network, you have a total of 832 + 9'248 + 205'056 + 2'570 = 217'706 learnable parameters, which is exactly what Lasagne reports.
building on top of #hbaderts's excellent reply, just came up with some formula for a I-C-P-C-P-H-O network (since i was working on a similar problem), sharing it in the figure below, may be helpful.
Also, (1) convolution layer with 2x2 stride and (2) convolution layer 1x1 stride + (max/avg) pooling with 2x2 stride, each contributes same numbers of parameters with 'same' padding, as can be seen below:
convolutional layers size is calculated=((n+2p-k)/s)+1
Here,
n is input p is padding k is kernel or filter s is stride
here in the above case
n=28 p=0 k=5 s=1