local inverse of a neural network - neural-network

I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.

If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.

This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.

zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.

The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.

An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.

Related

why linear function is useless in multiple layer neural network? How last layer become the linear function of the input of first layer?

I was studying about activation function in NN but could not understand this part properly -
"Each layer is activated by a linear function. That activation in turn goes into the next level as input and the second layer calculates weighted sum on that input and it in turn, fires based on another linear activation function.
No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer! "
This is one of the most interesting concepts that I came across while learning neural networks. Here is how I understood it:
The input Z to one layer can be written as a product of a weight matrix and a vector of the output of nodes in the previous layer. Thus Z_l = W_l * A_l-1 where Z_l is the input to the Lth layer. Now A_l = F(Z_l) where F is the activation function of the layer L. If the activation function is linear then A_l will be simply a factor K of Z_l. Hence, we can write Z_l somewhat as:
Z_l = W_l*W_l-1*W_l-2*...*X where X is the input. So you see the output Y will finally be the multiplication of a few matrices times the input vector for a particular data instance. We can always find a resultant multiplication of the weight matrices. Thus, output Y will be W_Transpose * X. This equation is nothing but a linear equation that we come across in linear regression.
Therefore, if all the input layers have linear activation, the output will only be a linear combination of the input and can be written using a simple linear equation.
It isn't really useless.
If there are multiple linearly activated layers, the results of the calculations in the previous layer would be sent to the next layer as input. Same thing happens in the next layer. It would calculate the input and send it based on another linear activation function to the next layer.
If all layers are linear it doesn't matter how much layers there actually are. The last activation function of final layer will also be a linear function of the input from the first layer.
If you want a good read about Activation Functions you can find one here and here.

Choice of cost function in Michael Neilsen's book: Neural Networks and deep learning

Here, w denotes the collection of all weights in the network, b all
the biases, n is the total number of training inputs, a is the vector
of outputs from the network when x is input, and the sum is over all
training inputs, x. Of course, the output aa depends on x, w and b,
but to keep the notation simple I haven't explicitly indicated this
dependence.
Taken from Michael Neilsen's Neural Network and Deep Learning
Does anyone know why he divides the sum by 2? I thought he was going to find the average by dividing by n; instead, he divides by 2n.
This is done so that when the partial derivatives of C(w, b) are computed, it will counter out with the 2 that is produced by the derivative of the quadratic term.
You are correct, normally we'd divide by n, but this trick is done for computational ease.

How to model best an input/output model with neural network?

The theme here is the use of neural network in learning time histories.
Lets consider for clarity Y = f(X) where X is a vector 1xN and Y 1xN.
In most of the models I can find or test online, X is directly the time vectorised with regular timesteps (X=T).
The prediction task performed on the time history is done therefore using the output Y, and using a sequence of this let say Y(i:i+Nsample) as an Neural Network input and then the output is Y(i+Nsample+1). Then the prediction is performed moving the window one step at the time ( i= i+1).
Now my question is the following. In the case where we have a vector X which is a generic function whose values are known, the problem to model with neural network is:
knowing X(i:i+Nsample+1) and Y(i:i+Nsample) we want to predict Y(i:i+Nsample+1)
then we can do i=i+1 and proceed forward.
What are the best solutions to design such a system, is there an example with Keras or other project from which I could be inspired?
I see several solutions but without being convinced.
a)Set a multidimensional vector (2xNsample) as input [X(i:i+Nsample ) ; Y(i-1:i+Nsample-1)] and predict Y(:i+Nsample) (treat the output as a second input)
b) set two separate lstm for X and Y and then concatenate them in some way

How can I use a neural network to model a quadratic equation?

A lot of examples I've seen about neural network to model mathematical functions are using sin / cos / etc. These are nicely bounded between 0 and 1.
What if I wanted to model something that was quadratic? y = ax^2 + bx + c? How can I modify my input data to fit this?
Presumably I'll have only one input (x value) and a bias input. The output will be the y. My training data will have negative numbers as well as positive numbers.
Thank you.
You can feed any real number into a neural network and it can theoretically output any number, so long as the last layer of your neural network is linear. If not, you could possibly multiply all the targets by really small number.

Gradient checking in backpropagation

I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?
Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.
Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
...
Just as with "normal" classification problems, the more input-target pairs, the better.