Architecture of the neural network - neural-network

I was reading a paper and the authors described their network as follows:
"To train the corresponding deep network, a fully connected network with one
hidden layer is used. The network has nine binary input nodes. The hidden layer contains one sigmoid node, and in the output layer there is one inner product
function. Thus, the network has 10 variables."
The network is used to predict a continuous number (y). My problem is, I do not understand the structure of the network after the sigmoid node. What does the output layer do? What is the inner product used for?

Usually, the pre-activation functions per neuron are a combination of an inner product (or dot product in vector-vector multiplication) and one addition to introduce a bias. A single neuron can be described as
z = b + w1*x1 + x2*x2 + ... + xn*xn
= b + w'*x
h = activation(z)
where b is an additive term (the neuron's bias) and each h is the output of one layer and corresponds to the input of the following layer. In the case of the "output layer", it is that y = h. A layer might also consist of multiple neurons or - like in your example - only of single neurons.
In the described case, it seems like no bias is used. I understand it as follows:
For each input neuron x1 to x9, a single weight is used, nothing fancy here. Since there are nine inputs, this makes 9 weights, resulting in something like:
hidden_out = sigmoid(w1*x1 + w2*x2 + ... + w9*x9)
In order to connect the hidden layer to the output, the same rule applies: The output layer's input is weighted and then summed over all inputs. Since there is only one input, only one weight is to be "summed", such that
output = w10*hidden_out
Keep in mind that the sigmoid function squashes its input onto an output range of 0..1, so multiplying it with a weight re-scales it to your required output range.

Related

Multilayer Perceptron with linear activation function

From the Wikipedia:
If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then it is easily proved with linear algebra that any number of layers can be reduced to the standard two-layer input-output model (see perceptron).
I have seen Multilayer Perceptron replaced with Single Layer Perceptron and what I understood is that this is because combination of linear functions can be expressed with a linear function and this is the only reason, am I right?
So how does reduction process look like? i.e. if we had 3x5x2 MLP, how would SLP look like? Is size of input layer based on the number of parameters used to express linear function like in the answer of link above?:
f(x) = a x + b
g(z) = c z + d
g(f(x)) = c (a x + b) + d = ac x + cb + d = (ac) x + (cb + d)
so it would be 4 inputs? (a, b, c, d since it is combination of two linear functions with different parameters)
Thanks in advance!
The size will be 3X2, and the hidden layer will just disappear, with all the weights of the hidden layer linear functions collapsed into a weights of the input layer. In this case of your example there 3 times 5 (input to hidden) i.e. 15 functions plus, 5 times 2 (hidden to output) i.e. 10 functions. So total 25 different linear functions. They are different because the weights in each case are different. So f(x) and g(z) as described by you are not a correct depiction.
The collapsing of hidden layer can be accomplished by simply taking an input neuron and an output neuron, and taking linear combination of all intermediate functions on the nodes that connect those two neurons together by passing through hidden layer. In the end you will be left with 6 unique functions which describe your 3X2 mapping.
For your own understanding try doing this on paper with a simple 2X2X1 MLP, with different weights on each node.

How can a well trained ANN have a single set of weights that can represent multiple classes?

In multinomial classification, I'm using soft-max activation function for all non-linear units and ANN has 'k' number of output nodes for 'k' number of classes. Each of the 'k' output nodes present in output layer is connected to all the weights in preceding layer, kind of like the one shown below.
So, if the first output node intends to pull the weights in it's favor, it will change all the weights that precede this layer and the other output nodes will also pull which usually contradicts to the direction in which the first one was pulling. It seems more like a tug of war with single set of weights. So, do we need a separate set of weights(,which includes weights for every node of every layer) for each of the output classes or is there a different form of architecture present? Please, correct me if I'm wrong.
Each node has its set of weights. Implementations and formulas usually use matrix multiplications, which can make you forget the fact that, conceptually, each node has its own set of weights, but they do.
Each node returns a single value that gets sent to every node in the next layer. So a node on layer h receives num(h - 1) inputs, where num(h - 1) is the number of nodes in layer h - 1. Let these inputs be x1, x2, ..., xk. Then the neuron returns:
x1*w1 + x2*w2 + ... + xk*wk
Or a function of this. So each neuron maintains its own set of weights.
Let's consider the network in your image. Assume that we have some training instance for which the topmost neuron should output 1 and the others 0.
So our target is:
y = [1 0 0 0]
And our actual output is (ignoring the softmax for simplicity):
y^ = [0.88 0.12 0.04 0.5]
So it's already doing pretty well, but we must still do backpropagation to make it even better.
Now, our output delta is:
y^ - y = [-0.12 0.12 0.04 0.5]
You will update the weights of the topmost neuron using the delta -0.12, of the second neuron using 0.12 and so on.
Notice that each output neuron's weights get updated using these values: these weights will all increase or decrease in order to approach the correct values (0 or 1).
Now, notice that each output neuron's output depends on the outputs of hidden neurons. So you must also update those. Those will get updated using each output neuron's delta (see page 7 here for the update formulas). This is like applying the chain rule when taking derivatives.
You're right that, for a given hidden neuron, there is a "tug of war" going on, with each output neuron's errors pulling their own way. But this is normal, because the hidden layer must learn to satisfy all output neurons. This is a reason for initializing the weights randomly and for using multiple hidden neurons.
It is the output layer that adapts to give the final answers, which it can do since the weights of the output nodes are independent of each other. The hidden layer has to be influenced by all output nodes, and it must learn to accommodate them all.

Gradient checking in backpropagation

I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?
Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.
Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
...
Just as with "normal" classification problems, the more input-target pairs, the better.

Neural Network layer design

I am kind of new to neural network. This is one piece of code I've tried in Matlab
P= 0 + (rand(1) * 10);
T = (P-1)/(P+1);
net = newelm(P,T,5);
net = train(net,P,T);
Y = sim(net,P);
Now when I type net.B{1} and net.LW{1} in the command window of matlab, I get the bias weights and layer weights, but I also find that these weight values keep changing according to input values.
So can I have a predefined weight value, the one that doesn't change, for a particular function(and for any value of input), such that using these weight values I can design a neural network for a particular function. Like here I have T which is related to P by a particular equation.
If one of your inputs has a known relation to the output variable, take it out of the network instead of creating a complex workaround like fixing network weights. (It will be complex because of the variable interactions and nonlinear transformations inside the network.)
E.g.
Y = a*X1 + 3.6*X2 # relationship between Y and X2 is known
Then use neural network on this relation:
Y - 3.6*X2 = a*X1
^^^^^^^^^^ ^^^^
[target] [input]

Matlab Multilayer Perceptron Question

I need to classify a dataset using Matlab MLP and show classification.
The dataset looks like
Click to view
What I have done so far is:
I have create an neural network contains a hidden layer (two neurons
?? maybe someone could give me some suggestions on how many
neurons are suitable for my example) and a output layer (one
neuron).
I have used several different learning methods such as Delta bar
Delta, backpropagation (both of these methods are used with or -out
momentum and Levenberg-Marquardt.)
This is the code I used in Matlab(Levenberg-Marquardt example)
net = newff(minmax(Input),[2 1],{'logsig' 'logsig'},'trainlm');
net.trainParam.epochs = 10000;
net.trainParam.goal = 0;
net.trainParam.lr = 0.1;
[net tr outputs] = train(net,Input,Target);
The following shows hidden neuron classification boundaries generated by Matlab on the data, I am little bit confused, beacause network should produce nonlinear result, but the result below seems that two boundary lines are linear..
Click to view
The code for generating above plot is:
figure(1)
plotpv(Input,Target);
hold on
plotpc(net.IW{1},net.b{1});
hold off
I also need to plot the output function of the output neuron, but I am stucking on this step. Can anyone give me some suggestions?
Thanks in advance.
Regarding the number of neurons in the hidden layer, for such an small example two are more than enough. The only way to know for sure the optimum is to test with different numbers. In this faq you can find a rule of thumb that may be useful: http://www.faqs.org/faqs/ai-faq/neural-nets/
For the output function, it is often useful to divide it in two steps:
First, given the input vector x, the output of the neurons in the hidden layer is y = f(x) = x^T w + b where w is the weight matrix from the input neurons to the hidden layer and b is the bias vector.
Second, you will have to apply the activation function g of the network to the resulting vector of the previous step z = g(y)
Finally, the output is the dot product h(z) = z . v + n, where v is the weight vector from the hidden layer to the output neuron and n the bias. In the case of more than one output neurons, you will repeat this for each one.
I've never used the matlab mlp functions, so I don't know how to get the weights in this case, but I'm sure the network stores them somewhere. Edit: Searching the documentation I found the properties:
net.IW numLayers-by-numInputs cell array of input weight values
net.LW numLayers-by-numLayers cell array of layer weight values
net.b numLayers-by-1 cell array of bias values