Multilayer Perceptron with linear activation function - neural-network

From the Wikipedia:
If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then it is easily proved with linear algebra that any number of layers can be reduced to the standard two-layer input-output model (see perceptron).
I have seen Multilayer Perceptron replaced with Single Layer Perceptron and what I understood is that this is because combination of linear functions can be expressed with a linear function and this is the only reason, am I right?
So how does reduction process look like? i.e. if we had 3x5x2 MLP, how would SLP look like? Is size of input layer based on the number of parameters used to express linear function like in the answer of link above?:
f(x) = a x + b
g(z) = c z + d
g(f(x)) = c (a x + b) + d = ac x + cb + d = (ac) x + (cb + d)
so it would be 4 inputs? (a, b, c, d since it is combination of two linear functions with different parameters)
Thanks in advance!

The size will be 3X2, and the hidden layer will just disappear, with all the weights of the hidden layer linear functions collapsed into a weights of the input layer. In this case of your example there 3 times 5 (input to hidden) i.e. 15 functions plus, 5 times 2 (hidden to output) i.e. 10 functions. So total 25 different linear functions. They are different because the weights in each case are different. So f(x) and g(z) as described by you are not a correct depiction.
The collapsing of hidden layer can be accomplished by simply taking an input neuron and an output neuron, and taking linear combination of all intermediate functions on the nodes that connect those two neurons together by passing through hidden layer. In the end you will be left with 6 unique functions which describe your 3X2 mapping.
For your own understanding try doing this on paper with a simple 2X2X1 MLP, with different weights on each node.

Related

Assuming the order Conv2d->ReLU->BN, should the Conv2d layer have a bias parameter?

Should we include the bias parameter in Conv2d if we are going for Conv2d followed by ReLU followed by batch norm (bn)?
There is no need if we go for Conv2d followed by bn followed by ReLU, since the shift parameter of bn takes care of bias work.
Yes, if the order is conv2d -> ReLU -> BatchNorm, then having a bias parameter in the convolution can help. To show that, let's assume that there is a bias in the convolution layer, and let's compare what happens with both of the orders you mention in the question. The idea is to see whether the bias is useful for each case.
Let's consider a single pixel from one of the convolution's output layers, and assume that x_1, ..., x_k are the corresponding inputs (in vectorised form) from the batch (batch size == k). We can write the convolution as
Wx+b #with W the convolution weights, b the bias
As you said in the question, when the order is conv2d-> BN -> ReLu, then the bias is not useful because all it does to the distribution of the Wx is shift it by b, and this is cancelled out by the immediate BN layer:
(Wx_i - mu)/sigma ==> becomes (Wx_i + b - mu - b)/sigma i.e. no changes.
However, if you use the other order, i.e
BN(ReLU(Wx+b))
then ReLU will map some of the Wx_i+b to 0ยท As a consequence, the mean will look like this:
(1/k)(0+...+0+ SUM_s (Wx_s+b))=some_term + b/k
and the std will look like
const*((0-some_term-b/k)^2 + ... + (Wx_i+b - some_term -b/k)^2 +...))
and as you can see from expanding those therms that depend on non-zero Wx_i+b:
(Wx_i+b - some_term - b/k)^2 = some_other_term + some_factor * W * b/k * x_i
which means that the result will depend on b in a multiplicative manner. As a result, its absence can't just be compensated by the shift component of the BN layer (noted beta in most implementation and papers). That is why having a bias term when using this order is not useless.

Architecture of the neural network

I was reading a paper and the authors described their network as follows:
"To train the corresponding deep network, a fully connected network with one
hidden layer is used. The network has nine binary input nodes. The hidden layer contains one sigmoid node, and in the output layer there is one inner product
function. Thus, the network has 10 variables."
The network is used to predict a continuous number (y). My problem is, I do not understand the structure of the network after the sigmoid node. What does the output layer do? What is the inner product used for?
Usually, the pre-activation functions per neuron are a combination of an inner product (or dot product in vector-vector multiplication) and one addition to introduce a bias. A single neuron can be described as
z = b + w1*x1 + x2*x2 + ... + xn*xn
= b + w'*x
h = activation(z)
where b is an additive term (the neuron's bias) and each h is the output of one layer and corresponds to the input of the following layer. In the case of the "output layer", it is that y = h. A layer might also consist of multiple neurons or - like in your example - only of single neurons.
In the described case, it seems like no bias is used. I understand it as follows:
For each input neuron x1 to x9, a single weight is used, nothing fancy here. Since there are nine inputs, this makes 9 weights, resulting in something like:
hidden_out = sigmoid(w1*x1 + w2*x2 + ... + w9*x9)
In order to connect the hidden layer to the output, the same rule applies: The output layer's input is weighted and then summed over all inputs. Since there is only one input, only one weight is to be "summed", such that
output = w10*hidden_out
Keep in mind that the sigmoid function squashes its input onto an output range of 0..1, so multiplying it with a weight re-scales it to your required output range.

How does the linear transfer function in perceptrons (artificial neural network) work?

I know how the step transfer function works but how does the linear transfer function work? What equation do you use?
Relate answer to AND gate with two inputs and a bias
First of all, in general you want to apply linear transfer function only in the output layer of an MLP and "never" in the hidden layers, where non-linear transfer functions are typically used (logistic function, step. etc.).
Linear transfer function (in the form of f(x) = x for pure linear or purelin as it is mentioned in literature) is typically used for function approximation / regression tasks (this is intuitive because step and logistic functions give binary results where the linear function gives continuous results).
Non- linear transfer functions are used for classification tasks.
Non-linear transfer function(aka: activation function) is the most important factor which assigns the nonlinear approximation capability to the simple fully connected multilayer neural network.
Nevertheless, 'linear' activation function, of course, is one of the many alternatives you might want to adopt. But the problem is, pure linear transfer(f(x) = x) in hidden layers doesn't make sense for us, which means it may be 'in vain' if we try to train a network whose hidden units are activated by pure linear function.
We may understand this process with the following:
Assuming f(x)=x is our activation function, and we try to train a single hidden layer network having 2 input units(x1,x2), 3 hidden units(a1,a2,a3) and 1 output unit(y).
Hence, the network tries to approximate the function :
# hidden units
a1 = f(w11*x1+w12*x2+b1) = w11*x1+w12*x2+b1
a2 = f(w21*x1+w22*x2+b2) = w21*x1+w22*x2+b2
a3 = f(w31*x1+w32*x2+b3) = w31*x1+w32*x2+b3
# output unit
y = c1*a1+c2*a2+c3*a3+b4
if we combine all these equations, it turns out:
y = c1(w11*x1+w12*x2+b1) + c2(w21*x1+w22*x2+b2) + c3(w31*x1+w32*x2+b3) + b4
= (c1*w11+c2*w21+c3*w31)*x1 + (c1*w12+c2*w22+c3*w32)*x2 + (c1*b1+c2*b2+c3*b3+b4)
= A1*x1+A2*x2+C
As shown above, linear activation degenerate the network into a single input-output linear product, regardless of the structure of the network. What was done during the training process is factorizing A1, A2 and C into various factors.
Even one very popular quasi-linear activation function call Relu in deep neural network is also rectified. In other words, no pure linear activation in hidden layers is used unless you want to factorize coefficients.

local inverse of a neural network

I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.

Matlab Multilayer Perceptron Question

I need to classify a dataset using Matlab MLP and show classification.
The dataset looks like
Click to view
What I have done so far is:
I have create an neural network contains a hidden layer (two neurons
?? maybe someone could give me some suggestions on how many
neurons are suitable for my example) and a output layer (one
neuron).
I have used several different learning methods such as Delta bar
Delta, backpropagation (both of these methods are used with or -out
momentum and Levenberg-Marquardt.)
This is the code I used in Matlab(Levenberg-Marquardt example)
net = newff(minmax(Input),[2 1],{'logsig' 'logsig'},'trainlm');
net.trainParam.epochs = 10000;
net.trainParam.goal = 0;
net.trainParam.lr = 0.1;
[net tr outputs] = train(net,Input,Target);
The following shows hidden neuron classification boundaries generated by Matlab on the data, I am little bit confused, beacause network should produce nonlinear result, but the result below seems that two boundary lines are linear..
Click to view
The code for generating above plot is:
figure(1)
plotpv(Input,Target);
hold on
plotpc(net.IW{1},net.b{1});
hold off
I also need to plot the output function of the output neuron, but I am stucking on this step. Can anyone give me some suggestions?
Thanks in advance.
Regarding the number of neurons in the hidden layer, for such an small example two are more than enough. The only way to know for sure the optimum is to test with different numbers. In this faq you can find a rule of thumb that may be useful: http://www.faqs.org/faqs/ai-faq/neural-nets/
For the output function, it is often useful to divide it in two steps:
First, given the input vector x, the output of the neurons in the hidden layer is y = f(x) = x^T w + b where w is the weight matrix from the input neurons to the hidden layer and b is the bias vector.
Second, you will have to apply the activation function g of the network to the resulting vector of the previous step z = g(y)
Finally, the output is the dot product h(z) = z . v + n, where v is the weight vector from the hidden layer to the output neuron and n the bias. In the case of more than one output neurons, you will repeat this for each one.
I've never used the matlab mlp functions, so I don't know how to get the weights in this case, but I'm sure the network stores them somewhere. Edit: Searching the documentation I found the properties:
net.IW numLayers-by-numInputs cell array of input weight values
net.LW numLayers-by-numLayers cell array of layer weight values
net.b numLayers-by-1 cell array of bias values