How to set constraints on weights based on other weights in the same layer [Pytorch]

As a simple example, take a neural network with 4 input nodes and 1 output node. I would like to take the 4 corresponding weights and set the constraint W1 < W2 < W3 < W4. Is this possible in Pytorch?


Why modifying the weights of a recurrent neural network in MATLAB does not cause the output to change when predicting on same data?

I consider the following recurrent neural network (RNN):
RNN under consideration
where x is the input (a vector of reals), h the hidden state vector and y is the output vector. I trained the network on Matlab using some data x and obtained W, V, and U.
However, in MATLAB after changing matrix W to W', and keeping U,V the same, the output (y) of the RNN that uses W is the same as the output (y') of the RNN that uses W' when both predict on the same data x. Those two outputs should be different just by looking at the above equation, but I don't seem to be able to do that in MATLAB (when I modify V or U, the outputs do change). How could I fix the code so that the outputs (y) and (y') are different as they should be?
The relevant code is shown below:
[x,t] = simplefit_dataset; % x: input data ; t: targets
net = newelm(x,t,5); % Recurrent neural net with 1 hidden layer (5 nodes) and 1 output layer (1 node)
net.layers{1}.transferFcn = 'tansig'; % 'tansig': equivalent to tanh and also is the activation function used for hidden layer
net.biasConnect = [0;0]; % biases set to zero for easier experimenting
net.derivFcn ='defaultderiv'; % defaultderiv: tells Matlab to pick whatever derivative scheme works best for this net
view(net) % displays the network topology
net = train(net,x,t); % trains the network
W = net.LW{1,1}; U = net.IW{1,1}; V = net.LW{2,1}; % network matrices
Y = net(x); % Y: output when predicting on data x using W
net.LW{1,1} = rand(5,5); % This is the modified matrix W, W'
Y_prime = net(x) % Y_prime: output when predicting on data x using W'
max(abs(Y-Y_prime )); % The difference between the two outputs is 0 when it probably shouldn't be.
Edit: minor corrections.
This is the recursion in your first layer: (from the docs)
The weight matrix for the weight going to the ith layer from the jth
layer (or a null matrix [ ]) is located at net.LW{i,j} if
net.layerConnect(i,j) is 1 (or 0).
So net.LW{1,1} are the weights to the first layer from the first layer (i.e. recursion), whereas net.LW{2,1} stores the weights to the second layer from the first layer. Now, what does it mean when one can change the weights of the recursion randomly without any effect (in fact, you can set them to zero net.LW{1,1} = zeros(size(W)); without an effect). Note that this essentially is the same as if you drop the recursion and create as simple feed-forward network:
Hypothesis: The recursion has no effect.
You will note that if you change the weights to the second layer (1 neuron) from the first layer (5 neurons) net.LW{2,1} = zeros(size(V));, it will affect your prediction (the same is of course true if you change the input weights net.IW).
Why does the recursion has no effect?
Well, that beats me. I have no idea where this special glitch is or what the theory is behind the newelm network.

Neural network: weights and biases convergence

I've been reading up on a few topics regarding machine learning, neural networks and deep learning, one of which is this (in my opinion) excellent online book:
For the most part I've come to understand the workings of a neural network but there is one question which still bugs me (which is based on the example on the website):
I consider a three layer neural network with an input layer, hidden layer and output layer. Say these layers have 2, 3 and 1 neurons (although the amount doesn't really matter).
Now an input is given: x1 and x2. Because the network is [2, 3, 1] the weights are randomly generated the first time being a list containing a 2x3 and a 3x1 matrix. The biases is a list of a 3x1 and 1x1 matrix.
Now the part I don't get:
The formula calculated in in the hidden layer:
weights x input - biases = 0
On every iteration the weights and biases are changed slightly, based on the derivative in order to find a global optimum. If this is the cases, why don't the biases and weights for every neuron converge to the same weights and biases?
I think I found the answer by doing some tests as well as finding some information on the internet. The answer lies in the having random initial weigths and biases. If all "neurons" would be equal they would all come to the same result since the weights, biases and inputs are equal. Having random weights allows for different answers:
x1 = 1
x2 = 2
x3 = 3
w1 = [0, 0, 1], giving w dot x = 3
w2 = [3, 0, 0], giving w dot x = 3
If anyone can confirm, please do so.

How to calculate the number of parameters for convolutional neural network?

I'm using Lasagne to create a CNN for the MNIST dataset. I'm following closely to this example: Convolutional Neural Networks and Feature Extraction with Python.
The CNN architecture I have at the moment, which doesn't include any dropout layers, is:
layers=[('input', layers.InputLayer), # Input Layer
('conv2d1', layers.Conv2DLayer), # Convolutional Layer
('maxpool1', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('conv2d2', layers.Conv2DLayer), # Convolutional Layer
('maxpool2', layers.MaxPool2DLayer), # 2D Max Pooling Layer
('dense', layers.DenseLayer), # Fully connected layer
('output', layers.DenseLayer), # Output Layer
# input layer
input_shape=(None, 1, 28, 28),
# layer conv2d1
conv2d1_filter_size=(5, 5),
# layer maxpool1
maxpool1_pool_size=(2, 2),
# layer conv2d2
conv2d2_filter_size=(3, 3),
# layer maxpool2
maxpool2_pool_size=(2, 2),
# Fully Connected Layer
# output Layer
# optimization method params
update= momentum,
This outputs the following Layer Information:
# name size
--- -------- --------
0 input 1x28x28
1 conv2d1 32x24x24
2 maxpool1 32x12x12
3 conv2d2 32x10x10
4 maxpool2 32x5x5
5 dense 256
6 output 10
and outputs the number of learnable parameters as 217,706
I'm wondering how this number is calculated? I've read a number of resources, including this StackOverflow's question, but none clearly generalizes the calculation.
If possible, can the calculation of the learnable parameters per layer be generalised?
For example, convolutional layer: number of filters x filter width x filter height.
Let's first look at how the number of learnable parameters is calculated for each individual type of layer you have, and then calculate the number of parameters in your example.
Input layer: All the input layer does is read the input image, so there are no parameters you could learn here.
Convolutional layers: Consider a convolutional layer which takes l feature maps at the input, and has k feature maps as output. The filter size is n x m. For example, this will look like this:
Here, the input has l=32 feature maps as input, k=64 feature maps as output, and the filter size is n=3 x m=3. It is important to understand, that we don't simply have a 3x3 filter, but actually a 3x3x32 filter, as our input has 32 dimensions. And we learn 64 different 3x3x32 filters.
Thus, the total number of weights is n*m*k*l.
Then, there is also a bias term for each feature map, so we have a total number of parameters of (n*m*l+1)*k.
Pooling layers: The pooling layers e.g. do the following: "replace a 2x2 neighborhood by its maximum value". So there is no parameter you could learn in a pooling layer.
Fully-connected layers: In a fully-connected layer, all input units have a separate weight to each output unit. For n inputs and m outputs, the number of weights is n*m. Additionally, you have a bias for each output node, so you are at (n+1)*m parameters.
Output layer: The output layer is a normal fully-connected layer, so (n+1)*m parameters, where n is the number of inputs and m is the number of outputs.
The final difficulty is the first fully-connected layer: we do not know the dimensionality of the input to that layer, as it is a convolutional layer. To calculate it, we have to start with the size of the input image, and calculate the size of each convolutional layer. In your case, Lasagne already calculates this for you and reports the sizes - which makes it easy for us. If you have to calculate the size of each layer yourself, it's a bit more complicated:
In the simplest case (like your example), the size of the output of a convolutional layer is input_size - (filter_size - 1), in your case: 28 - 4 = 24. This is due to the nature of the convolution: we use e.g. a 5x5 neighborhood to calculate a point - but the two outermost rows and columns don't have a 5x5 neighborhood, so we can't calculate any output for those points. This is why our output is 2*2=4 rows/columns smaller than the input.
If one doesn't want the output to be smaller than the input, one can zero-pad the image (with the pad parameter of the convolutional layer in Lasagne). E.g. if you add 2 rows/cols of zeros around the image, the output size will be (28+4)-4=28. So in case of padding, the output size is input_size + 2*padding - (filter_size -1).
If you explicitly want to downsample your image during the convolution, you can define a stride, e.g. stride=2, which means that you move the filter in steps of 2 pixels. Then, the expression becomes ((input_size + 2*padding - filter_size)/stride) +1.
In your case, the full calculations are:
# name size parameters
--- -------- ------------------------- ------------------------
0 input 1x28x28 0
1 conv2d1 (28-(5-1))=24 -> 32x24x24 (5*5*1+1)*32 = 832
2 maxpool1 32x12x12 0
3 conv2d2 (12-(3-1))=10 -> 32x10x10 (3*3*32+1)*32 = 9'248
4 maxpool2 32x5x5 0
5 dense 256 (32*5*5+1)*256 = 205'056
6 output 10 (256+1)*10 = 2'570
So in your network, you have a total of 832 + 9'248 + 205'056 + 2'570 = 217'706 learnable parameters, which is exactly what Lasagne reports.
building on top of #hbaderts's excellent reply, just came up with some formula for a I-C-P-C-P-H-O network (since i was working on a similar problem), sharing it in the figure below, may be helpful.
Also, (1) convolution layer with 2x2 stride and (2) convolution layer 1x1 stride + (max/avg) pooling with 2x2 stride, each contributes same numbers of parameters with 'same' padding, as can be seen below:
convolutional layers size is calculated=((n+2p-k)/s)+1
n is input p is padding k is kernel or filter s is stride
here in the above case
n=28 p=0 k=5 s=1

Gradient checking in backpropagation

I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?
Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.
Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
Just as with "normal" classification problems, the more input-target pairs, the better.

clusterdata Matlab function

I am using Matlab clusterdata function to classify my data (noise and non-noise) into 2 categories: noise and non-noise groups. The function works well except that sometimes it names all noise data as group 1 and all non-noise data as group 2. Sometimes it names all noise data as group 2 and all non-noise data as group 1.
How can I control it? I mean label all noise data as group 1.
Having control over the name of the labels an unsupervised learning algorithm uses can generally be a problem. I suggets to try to evaluate some of the features of the data after doing the clustering to see if the labels are as you want them.
If all your data is in X (N x d) matrix, with a label vector Y(N x 1) taking values -1 and 1, you could evaluate the variance of each of the clusters. I suspect the noise data would exhibit higher variance, which could be used to see if the labels should be switched.
In the code below, 1 should be the non-noise, and -1 should be noise (this choice of labels (groups) makes it easier to flip the labels around).
%#Variance summed over all dimensions
varL1 = sum(var(X(Y==1,:)));
varL2= sum(var(X(Y==-1,:)));
%#Flip labels if if L1 is higher than L2
if varL1 > varL2
Y = Y * (-1);
If this works, you could afterwards change noise cluster to be group 1 and non-noise to group 2 by
Y(Y==1) = 2; %#NB: The order of which these statements are evaluated is important.
Y(Y==-1) = 1;