How does the linear transfer function in perceptrons (artificial neural network) work? - neural-network

I know how the step transfer function works but how does the linear transfer function work? What equation do you use?
Relate answer to AND gate with two inputs and a bias

First of all, in general you want to apply linear transfer function only in the output layer of an MLP and "never" in the hidden layers, where non-linear transfer functions are typically used (logistic function, step. etc.).
Linear transfer function (in the form of f(x) = x for pure linear or purelin as it is mentioned in literature) is typically used for function approximation / regression tasks (this is intuitive because step and logistic functions give binary results where the linear function gives continuous results).
Non- linear transfer functions are used for classification tasks.

Non-linear transfer function(aka: activation function) is the most important factor which assigns the nonlinear approximation capability to the simple fully connected multilayer neural network.
Nevertheless, 'linear' activation function, of course, is one of the many alternatives you might want to adopt. But the problem is, pure linear transfer(f(x) = x) in hidden layers doesn't make sense for us, which means it may be 'in vain' if we try to train a network whose hidden units are activated by pure linear function.
We may understand this process with the following:
Assuming f(x)=x is our activation function, and we try to train a single hidden layer network having 2 input units(x1,x2), 3 hidden units(a1,a2,a3) and 1 output unit(y).
Hence, the network tries to approximate the function :
# hidden units
a1 = f(w11*x1+w12*x2+b1) = w11*x1+w12*x2+b1
a2 = f(w21*x1+w22*x2+b2) = w21*x1+w22*x2+b2
a3 = f(w31*x1+w32*x2+b3) = w31*x1+w32*x2+b3
# output unit
y = c1*a1+c2*a2+c3*a3+b4
if we combine all these equations, it turns out:
y = c1(w11*x1+w12*x2+b1) + c2(w21*x1+w22*x2+b2) + c3(w31*x1+w32*x2+b3) + b4
= (c1*w11+c2*w21+c3*w31)*x1 + (c1*w12+c2*w22+c3*w32)*x2 + (c1*b1+c2*b2+c3*b3+b4)
= A1*x1+A2*x2+C
As shown above, linear activation degenerate the network into a single input-output linear product, regardless of the structure of the network. What was done during the training process is factorizing A1, A2 and C into various factors.
Even one very popular quasi-linear activation function call Relu in deep neural network is also rectified. In other words, no pure linear activation in hidden layers is used unless you want to factorize coefficients.

Related

why linear function is useless in multiple layer neural network? How last layer become the linear function of the input of first layer?

I was studying about activation function in NN but could not understand this part properly -
"Each layer is activated by a linear function. That activation in turn goes into the next level as input and the second layer calculates weighted sum on that input and it in turn, fires based on another linear activation function.
No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer! "
This is one of the most interesting concepts that I came across while learning neural networks. Here is how I understood it:
The input Z to one layer can be written as a product of a weight matrix and a vector of the output of nodes in the previous layer. Thus Z_l = W_l * A_l-1 where Z_l is the input to the Lth layer. Now A_l = F(Z_l) where F is the activation function of the layer L. If the activation function is linear then A_l will be simply a factor K of Z_l. Hence, we can write Z_l somewhat as:
Z_l = W_l*W_l-1*W_l-2*...*X where X is the input. So you see the output Y will finally be the multiplication of a few matrices times the input vector for a particular data instance. We can always find a resultant multiplication of the weight matrices. Thus, output Y will be W_Transpose * X. This equation is nothing but a linear equation that we come across in linear regression.
Therefore, if all the input layers have linear activation, the output will only be a linear combination of the input and can be written using a simple linear equation.
It isn't really useless.
If there are multiple linearly activated layers, the results of the calculations in the previous layer would be sent to the next layer as input. Same thing happens in the next layer. It would calculate the input and send it based on another linear activation function to the next layer.
If all layers are linear it doesn't matter how much layers there actually are. The last activation function of final layer will also be a linear function of the input from the first layer.
If you want a good read about Activation Functions you can find one here and here.

How to use Deep Neural Networks for regression?

I wrote this script (Matlab) for classification using Softmax. Now I want to use same script for regression by replacing the Softmax output layer with a Sigmoid or ReLU activation function. But I wasn't able to do that.
X=houseInputs ;
T=houseTargets;
%Train an autoencoder with a hidden layer of size 10 and a linear transfer function for the decoder. Set the L2 weight regularizer to 0.001, sparsity regularizer to 4 and sparsity proportion to 0.05.
hiddenSize = 10;
autoenc1 = trainAutoencoder(X,hiddenSize,...
'L2WeightRegularization',0.001,...
'SparsityRegularization',4,...
'SparsityProportion',0.05,...
'DecoderTransferFunction','purelin');
%%
%Extract the features in the hidden layer.
features1 = encode(autoenc1,X);
%Train a second autoencoder using the features from the first autoencoder. Do not scale the data.
hiddenSize = 10;
autoenc2 = trainAutoencoder(features1,hiddenSize,...
'L2WeightRegularization',0.001,...
'SparsityRegularization',4,...
'SparsityProportion',0.05,...
'DecoderTransferFunction','purelin',...
'ScaleData',false);
features2 = encode(autoenc2,features1);
%%
softnet = trainSoftmaxLayer(features2,T,'LossFunction','crossentropy');
%Stack the encoders and the softmax layer to form a deep network.
deepnet = stack(autoenc1,autoenc2,softnet);
%Train the deep network on the wine data.
deepnet = train(deepnet,X,T);
%Estimate the deep network, deepnet.
y = deepnet(X);
Regression is a different problem from classification. You have to change your loss function to something that fits with a regression e.g. mean square error and of course change the number of neuron to one (you will only ouput 1 value on your last layer).
It is possible to use a Neural Network to perform a regression task but it might be an overkill for many tasks. True regression means to perform a mapping of one set of continuous inputs to another set of continuous outputs:
f: x -> ý
Changing the architecture of a neural network to make it perform a regression task is usually fairly simple. Instead of mapping the continuous input data to a specific class as it is done using the Softmax function as in your case, you have to make the network use only a single output node.
This node will just sum the outputs of the the previous layer (last hidden layer) and multiply the summed activations by 1. During the training process this output ý will be compared to the correct ground-truth value y that comes with your dataset. As a loss function you may use the Root-means-squared-error (RMSE).
Training such a network will result in a model that maps an arbitrary number of independent variables x to a dependent variable ý, which basically is a regression task.
To come back to your Matlab implementation, it would be incorrect to change the current Softmax output layer to be an activation function such as a Sigmoid or ReLU. Instead your would have to implement a custom RMSE output layer for your network, which is fed with the sum of activations coming from the last hidden layer of your network.

Confused by the notation (a and z) and usage of backpropagation equations used in neural networks gradient decent training

I’m writing a neural network but I have trouble training it using backpropagation so I suspect there is a bug/mathematical mistake somewhere in my code. I’ve spent ours reading different literature on how the equations of backpropagation should look but I’m a bit confused by it since different books say different things, or at least use wildly confusing and contradictory notation. So, I was hoping that someone who knows with a 100% certainty how it works could clear it out for me.
There are two steps in the backpropagation that confuse me. Let’s assume for simplicity that I only have a three layer feed forward net, so we have connections between input-hidden and hidden-output. I call the weighted sum that reaches a node z and the same value but after it has passed the activation function of the node a.
Apparently I’m not allowed to embed an image with the equations that my question concern so I will have to link it like this: https://i.stack.imgur.com/CvyyK.gif
Now. During backpropagation, when calculating the error in the nodes of the output layer, is it:
[Eq. 1] Delta_output = (output-target) * a_output through the derivative of the activation function
Or is it
[Eq. 2] Delta_output = (output-target) * z_output through the derivative of the activation function
And during the error calculation of the nodes in the hidden layer, same thing, is it:
[Eq. 3] Delta_hidden = a_h through the derivative of the activation function * sum(w_h*Delta_output)
Or is it
[Eq. 4] Delta_hidden = z_h through the derivative of the activation function * sum(w_h*Delta_output)
So the question is basically; when running a node's value through the derivative version of the activation function during backpropagation, should the value be expressed as it was before or after it passed the activation function (z or a)?
Is the first or the second equation in the image correct and similarly is the third or fourth equation in the image correct?
Thanks.
You have to compute the derivatives with the values before it have passed through the activation function. So the answer is "z".
Some activation functions simplify the computation of the derivative, like tanh:
a = tanh(z)
derivative on z of tanh(z) = 1.0 - tanh(z) * tanh(z) = 1.0 - a * a
This simplification can lead to the confusion you was talking about, but here is another activation function without possible confusion:
a = sin(z)
derivative on z of sin(z) = cos(z)
You can find a list of activation functions and their derivatives on wikipedia: activation function.
Some networks doesn't have an activation function on the output nodes, so the derivative is 1.0, and delta_output = output - target or delta_output = target - output, depending if you add or substract the weight change.
If you are using and activation function on the output nodes, the you'll have to give targets that are in the range of the activation function like [-1,1] for tanh(z).

Gradient checking in backpropagation

I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?
Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.
Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
...
Just as with "normal" classification problems, the more input-target pairs, the better.

local inverse of a neural network

I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.