Sample sizes in neural networks - neural-network

I've seen that the general rule of sample sizes in neural networks is 10 times the 'number of weights'. So for example, if we have a NxD input, one layer size M, and output layer size K, is the 'number of weights' 2 because of the two W matrices? Or is it DM + MK? Thank you in advance for your help.

The actual number of weights should be DM + MK + M + K where DM represents the number of weights for all unit in the hidden layer and M is for the bias terms in the hidden layer and in the same way, MK and K represent number of wights and bias terms in the output layer respectively.

Related

Why modifying the weights of a recurrent neural network in MATLAB does not cause the output to change when predicting on same data?

I consider the following recurrent neural network (RNN):
RNN under consideration
where x is the input (a vector of reals), h the hidden state vector and y is the output vector. I trained the network on Matlab using some data x and obtained W, V, and U.
However, in MATLAB after changing matrix W to W', and keeping U,V the same, the output (y) of the RNN that uses W is the same as the output (y') of the RNN that uses W' when both predict on the same data x. Those two outputs should be different just by looking at the above equation, but I don't seem to be able to do that in MATLAB (when I modify V or U, the outputs do change). How could I fix the code so that the outputs (y) and (y') are different as they should be?
The relevant code is shown below:
[x,t] = simplefit_dataset; % x: input data ; t: targets
net = newelm(x,t,5); % Recurrent neural net with 1 hidden layer (5 nodes) and 1 output layer (1 node)
net.layers{1}.transferFcn = 'tansig'; % 'tansig': equivalent to tanh and also is the activation function used for hidden layer
net.biasConnect = [0;0]; % biases set to zero for easier experimenting
net.derivFcn ='defaultderiv'; % defaultderiv: tells Matlab to pick whatever derivative scheme works best for this net
view(net) % displays the network topology
net = train(net,x,t); % trains the network
W = net.LW{1,1}; U = net.IW{1,1}; V = net.LW{2,1}; % network matrices
Y = net(x); % Y: output when predicting on data x using W
net.LW{1,1} = rand(5,5); % This is the modified matrix W, W'
Y_prime = net(x) % Y_prime: output when predicting on data x using W'
max(abs(Y-Y_prime )); % The difference between the two outputs is 0 when it probably shouldn't be.
Edit: minor corrections.
This is the recursion in your first layer: (from the docs)
The weight matrix for the weight going to the ith layer from the jth
layer (or a null matrix [ ]) is located at net.LW{i,j} if
net.layerConnect(i,j) is 1 (or 0).
So net.LW{1,1} are the weights to the first layer from the first layer (i.e. recursion), whereas net.LW{2,1} stores the weights to the second layer from the first layer. Now, what does it mean when one can change the weights of the recursion randomly without any effect (in fact, you can set them to zero net.LW{1,1} = zeros(size(W)); without an effect). Note that this essentially is the same as if you drop the recursion and create as simple feed-forward network:
Hypothesis: The recursion has no effect.
You will note that if you change the weights to the second layer (1 neuron) from the first layer (5 neurons) net.LW{2,1} = zeros(size(V));, it will affect your prediction (the same is of course true if you change the input weights net.IW).
Why does the recursion has no effect?
Well, that beats me. I have no idea where this special glitch is or what the theory is behind the newelm network.

How to calculate the number of parameters for convolutional neural network?

I'm using Lasagne to create a CNN for the MNIST dataset. I'm following closely to this example: Convolutional Neural Networks and Feature Extraction with Python.
The CNN architecture I have at the moment, which doesn't include any dropout layers, is:
NeuralNet(
layers=[('input', layers.InputLayer), # Input Layer
('conv2d1', layers.Conv2DLayer), # Convolutional Layer
('maxpool1', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('conv2d2', layers.Conv2DLayer), # Convolutional Layer
('maxpool2', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('dense', layers.DenseLayer), # Fully connected layer
('output', layers.DenseLayer), # Output Layer
],
# input layer
input_shape=(None, 1, 28, 28),
# layer conv2d1
conv2d1_num_filters=32,
conv2d1_filter_size=(5, 5),
conv2d1_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool1
maxpool1_pool_size=(2, 2),
# layer conv2d2
conv2d2_num_filters=32,
conv2d2_filter_size=(3, 3),
conv2d2_nonlinearity=lasagne.nonlinearities.rectify,
# layer maxpool2
maxpool2_pool_size=(2, 2),
# Fully Connected Layer
dense_num_units=256,
dense_nonlinearity=lasagne.nonlinearities.rectify,
# output Layer
output_nonlinearity=lasagne.nonlinearities.softmax,
output_num_units=10,
# optimization method params
update= momentum,
update_learning_rate=0.01,
update_momentum=0.9,
max_epochs=10,
verbose=1,
)
This outputs the following Layer Information:
# name size
--- -------- --------
0 input 1x28x28
1 conv2d1 32x24x24
2 maxpool1 32x12x12
3 conv2d2 32x10x10
4 maxpool2 32x5x5
5 dense 256
6 output 10
and outputs the number of learnable parameters as 217,706
I'm wondering how this number is calculated? I've read a number of resources, including this StackOverflow's question, but none clearly generalizes the calculation.
If possible, can the calculation of the learnable parameters per layer be generalised?
For example, convolutional layer: number of filters x filter width x filter height.
Let's first look at how the number of learnable parameters is calculated for each individual type of layer you have, and then calculate the number of parameters in your example.
Input layer: All the input layer does is read the input image, so there are no parameters you could learn here.
Convolutional layers: Consider a convolutional layer which takes l feature maps at the input, and has k feature maps as output. The filter size is n x m. For example, this will look like this:
Here, the input has l=32 feature maps as input, k=64 feature maps as output, and the filter size is n=3 x m=3. It is important to understand, that we don't simply have a 3x3 filter, but actually a 3x3x32 filter, as our input has 32 dimensions. And we learn 64 different 3x3x32 filters.
Thus, the total number of weights is n*m*k*l.
Then, there is also a bias term for each feature map, so we have a total number of parameters of (n*m*l+1)*k.
Pooling layers: The pooling layers e.g. do the following: "replace a 2x2 neighborhood by its maximum value". So there is no parameter you could learn in a pooling layer.
Fully-connected layers: In a fully-connected layer, all input units have a separate weight to each output unit. For n inputs and m outputs, the number of weights is n*m. Additionally, you have a bias for each output node, so you are at (n+1)*m parameters.
Output layer: The output layer is a normal fully-connected layer, so (n+1)*m parameters, where n is the number of inputs and m is the number of outputs.
The final difficulty is the first fully-connected layer: we do not know the dimensionality of the input to that layer, as it is a convolutional layer. To calculate it, we have to start with the size of the input image, and calculate the size of each convolutional layer. In your case, Lasagne already calculates this for you and reports the sizes - which makes it easy for us. If you have to calculate the size of each layer yourself, it's a bit more complicated:
In the simplest case (like your example), the size of the output of a convolutional layer is input_size - (filter_size - 1), in your case: 28 - 4 = 24. This is due to the nature of the convolution: we use e.g. a 5x5 neighborhood to calculate a point - but the two outermost rows and columns don't have a 5x5 neighborhood, so we can't calculate any output for those points. This is why our output is 2*2=4 rows/columns smaller than the input.
If one doesn't want the output to be smaller than the input, one can zero-pad the image (with the pad parameter of the convolutional layer in Lasagne). E.g. if you add 2 rows/cols of zeros around the image, the output size will be (28+4)-4=28. So in case of padding, the output size is input_size + 2*padding - (filter_size -1).
If you explicitly want to downsample your image during the convolution, you can define a stride, e.g. stride=2, which means that you move the filter in steps of 2 pixels. Then, the expression becomes ((input_size + 2*padding - filter_size)/stride) +1.
In your case, the full calculations are:
# name size parameters
--- -------- ------------------------- ------------------------
0 input 1x28x28 0
1 conv2d1 (28-(5-1))=24 -> 32x24x24 (5*5*1+1)*32 = 832
2 maxpool1 32x12x12 0
3 conv2d2 (12-(3-1))=10 -> 32x10x10 (3*3*32+1)*32 = 9'248
4 maxpool2 32x5x5 0
5 dense 256 (32*5*5+1)*256 = 205'056
6 output 10 (256+1)*10 = 2'570
So in your network, you have a total of 832 + 9'248 + 205'056 + 2'570 = 217'706 learnable parameters, which is exactly what Lasagne reports.
building on top of #hbaderts's excellent reply, just came up with some formula for a I-C-P-C-P-H-O network (since i was working on a similar problem), sharing it in the figure below, may be helpful.
Also, (1) convolution layer with 2x2 stride and (2) convolution layer 1x1 stride + (max/avg) pooling with 2x2 stride, each contributes same numbers of parameters with 'same' padding, as can be seen below:
convolutional layers size is calculated=((n+2p-k)/s)+1
Here,
n is input p is padding k is kernel or filter s is stride
here in the above case
n=28 p=0 k=5 s=1

Multilayer Perceptron with linear activation function

From the Wikipedia:
If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then it is easily proved with linear algebra that any number of layers can be reduced to the standard two-layer input-output model (see perceptron).
I have seen Multilayer Perceptron replaced with Single Layer Perceptron and what I understood is that this is because combination of linear functions can be expressed with a linear function and this is the only reason, am I right?
So how does reduction process look like? i.e. if we had 3x5x2 MLP, how would SLP look like? Is size of input layer based on the number of parameters used to express linear function like in the answer of link above?:
f(x) = a x + b
g(z) = c z + d
g(f(x)) = c (a x + b) + d = ac x + cb + d = (ac) x + (cb + d)
so it would be 4 inputs? (a, b, c, d since it is combination of two linear functions with different parameters)
Thanks in advance!
The size will be 3X2, and the hidden layer will just disappear, with all the weights of the hidden layer linear functions collapsed into a weights of the input layer. In this case of your example there 3 times 5 (input to hidden) i.e. 15 functions plus, 5 times 2 (hidden to output) i.e. 10 functions. So total 25 different linear functions. They are different because the weights in each case are different. So f(x) and g(z) as described by you are not a correct depiction.
The collapsing of hidden layer can be accomplished by simply taking an input neuron and an output neuron, and taking linear combination of all intermediate functions on the nodes that connect those two neurons together by passing through hidden layer. In the end you will be left with 6 unique functions which describe your 3X2 mapping.
For your own understanding try doing this on paper with a simple 2X2X1 MLP, with different weights on each node.

Matlab Neural Network Structure

I'm full newbie in neural networks. I generated NN in matlab. Further I need to know exact structure of this NN, because I need to implement it in Java(static connections and weights, no learning). Can you explain how to connect neurons and what math operations perform in each element?
NN params are next(taken from Matlab):
iw{1,1} - Weight to layer 1 from intput 1
[2.8574 -1.9207;
1.7582 -1.2549;
-4.5925 0.23236;
12.0861 12.3701;
2.503 -1.9321;
-2.1422 2.6928]
lw{2,1} - Weight to layer
[-0.51977 5.3993 3.4349 5.2863 3.1976 -0.67102]
b{1} - Bias to layer 1
[-3.2811;
-6.956;
-3.0943;
11.1103;
0.14842;
-3.3705]
b{2} - Bias to layer 2
[1.4657]
Transfer function TANSIG
Greatly appreciate your help.
You have a NN that has 2 inputs, then a hidden layer of 6 neurons and an output layer of 1 neuron.
Each of the neuron in each layer, will take all the outputs from the previous one and multiply them by a number and offset the result by another.
The numbers you show are the numbers I mentioned.
For example, the neuron 1 from hidden layer will output hidden1=2.8574*in1 -1.9207*in2-3.2811. Then take whatever sigma function you are using and apply hidden1=sigma(hidden1).
As another example, the output will be out=-hidden1*0.51977+hidden2*5.3993+...-hidden6*0.67102+1.4657

Matlab Multilayer Perceptron Question

I need to classify a dataset using Matlab MLP and show classification.
The dataset looks like
Click to view
What I have done so far is:
I have create an neural network contains a hidden layer (two neurons
?? maybe someone could give me some suggestions on how many
neurons are suitable for my example) and a output layer (one
neuron).
I have used several different learning methods such as Delta bar
Delta, backpropagation (both of these methods are used with or -out
momentum and Levenberg-Marquardt.)
This is the code I used in Matlab(Levenberg-Marquardt example)
net = newff(minmax(Input),[2 1],{'logsig' 'logsig'},'trainlm');
net.trainParam.epochs = 10000;
net.trainParam.goal = 0;
net.trainParam.lr = 0.1;
[net tr outputs] = train(net,Input,Target);
The following shows hidden neuron classification boundaries generated by Matlab on the data, I am little bit confused, beacause network should produce nonlinear result, but the result below seems that two boundary lines are linear..
Click to view
The code for generating above plot is:
figure(1)
plotpv(Input,Target);
hold on
plotpc(net.IW{1},net.b{1});
hold off
I also need to plot the output function of the output neuron, but I am stucking on this step. Can anyone give me some suggestions?
Thanks in advance.
Regarding the number of neurons in the hidden layer, for such an small example two are more than enough. The only way to know for sure the optimum is to test with different numbers. In this faq you can find a rule of thumb that may be useful: http://www.faqs.org/faqs/ai-faq/neural-nets/
For the output function, it is often useful to divide it in two steps:
First, given the input vector x, the output of the neurons in the hidden layer is y = f(x) = x^T w + b where w is the weight matrix from the input neurons to the hidden layer and b is the bias vector.
Second, you will have to apply the activation function g of the network to the resulting vector of the previous step z = g(y)
Finally, the output is the dot product h(z) = z . v + n, where v is the weight vector from the hidden layer to the output neuron and n the bias. In the case of more than one output neurons, you will repeat this for each one.
I've never used the matlab mlp functions, so I don't know how to get the weights in this case, but I'm sure the network stores them somewhere. Edit: Searching the documentation I found the properties:
net.IW numLayers-by-numInputs cell array of input weight values
net.LW numLayers-by-numLayers cell array of layer weight values
net.b numLayers-by-1 cell array of bias values