I want to model a NN that solves the XOR problem, so I know that a solution to
x1 xor x2 = (x1 or x2) and not(x1 and x2)
so I have the following models of NN:
The problem that I have is when I want to connect these partial neural networks, I made a solution like this:
but I dont get the values of the XOR function. I have seen a solution using these set of neural networks, but they obtain the values of x1 XNOR x2, so they use:
x1 and x2
not x1 and not x2
and at the end they join both values with the NN that represents the OR.
The question that I have is how to join my partial neural networks to have a neural network of one hidden layer that uses the forward propagation algorithm. The activation function is the sigmoid one.
Any help?
The issue with the merging of your partial neural networks is, that you mix the 'not' (at node a2) and the second 'and' (between a1 and a2) operations together. This does not work the way you did it.
You can either do the 'not' operation separately and make another node after a2 and then do the 'and' operation with the new node and a1.
Alternatively you can adjust the weight of the second bias (a0) to -10.
Related
Matlab documents two ways to use LSTM networks for regression:
sequence-to-sequence: The output of the LSTM layer is a sequence, fed into a fully connected layer. lstmLayer(N, 'OutputMode', 'sequence').
sequence-to-one: The output of the LSTM layer is the last element of the sequence, fed into a fully connected layer. lstmLayer(N, 'OutputMode', 'last').
What is the difference between the two in context of time series prediction? When should one be used over the other?
Notes: An example for time-series prediction uses sequence-to-sequence architecture. If all you need is to predict the next time step, why output a whole sequence? I did not see any examples of sequence-to-one regression.
sequence-to-sequence: The output is the hidden state of the LSTM cell at each time step in the input sequence. We want the state of the LSTM as it consumes each point in the sequence and considers its previous state. For example when you are differentiating a time series, you want the "gradient" at each point in the sequence:
x4 x3 x2 x1
[LSTM]
(h1)
x4 x3 x2 x1
[LSTM]
(h2) (h1)
x4 x3 x2 x1
[LSTM]
(h3) (h2) (h1)
The LSTM is basically translating the input sequence into an output sequence. The output would be (h4) (h3) (h2) (h1).
sequence-to-one: In this case, it is assumed that all we want is the state of the LSTM after consuming the whole sequence. For example, when you are integrating a time-series, you want the end result after integrating the whole sequence:
x4 x3 x2 x1
[LSTM]
(h4) (h3) (h2) (h1)
So the output would just be (h4).
If you want to predict i events after the given sequence, you can use seq-to-seq (the second sequence with the size of i) and when you want to predict the next step of the input sequence, you can using seq-to-one to predict the next step of the input sequence.
The theme here is the use of neural network in learning time histories.
Lets consider for clarity Y = f(X) where X is a vector 1xN and Y 1xN.
In most of the models I can find or test online, X is directly the time vectorised with regular timesteps (X=T).
The prediction task performed on the time history is done therefore using the output Y, and using a sequence of this let say Y(i:i+Nsample) as an Neural Network input and then the output is Y(i+Nsample+1). Then the prediction is performed moving the window one step at the time ( i= i+1).
Now my question is the following. In the case where we have a vector X which is a generic function whose values are known, the problem to model with neural network is:
knowing X(i:i+Nsample+1) and Y(i:i+Nsample) we want to predict Y(i:i+Nsample+1)
then we can do i=i+1 and proceed forward.
What are the best solutions to design such a system, is there an example with Keras or other project from which I could be inspired?
I see several solutions but without being convinced.
a)Set a multidimensional vector (2xNsample) as input [X(i:i+Nsample ) ; Y(i-1:i+Nsample-1)] and predict Y(:i+Nsample) (treat the output as a second input)
b) set two separate lstm for X and Y and then concatenate them in some way
In TensorFlow or Theano, you only tell the library how your neural network is, and how feed-forward should operate.
For instance, in TensorFlow, you would write:
with graph.as_default():
_X = tf.constant(X)
_y = tf.constant(y)
hidden = 20
w0 = tf.Variable(tf.truncated_normal([X.shape[1], hidden]))
b0 = tf.Variable(tf.truncated_normal([hidden]))
h = tf.nn.softmax(tf.matmul(_X, w0) + b0)
w1 = tf.Variable(tf.truncated_normal([hidden, 1]))
b1 = tf.Variable(tf.truncated_normal([1]))
yp = tf.nn.softmax(tf.matmul(h, w1) + b1)
loss = tf.reduce_mean(0.5*tf.square(yp - _y))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
I am using L2-norm loss function, C=0.5*sum((y-yp)^2), and in the backpropagation step presumably the derivative will have to be computed, dC=sum(y-yp). See (30) in this book.
My question is: how can TensorFlow (or Theano) know the analytical derivative for backpropagation? Or do they do an approximation? Or somehow do not use the derivative?
I have done the deep learning udacity course on TensorFlow, but I am still at odds at how to make sense on how these libraries work.
The differentiation happens in the final line:
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
When you execute the minimize() method, TensorFlow identifies the set of variables on which loss depends, and computes gradients for each of these. The differentiation is implemented in ops/gradients.py, and it uses "reverse accumulation". Essentially it searches backwards from the loss tensor to the variables, applying the chain rule at each operator in the dataflow graph. TensorFlow includes "gradient functions" for most (differentiable) operators, and you can see an example of how these are implemented in ops/math_grad.py. A gradient function can use the original op (including its inputs, outputs, and attributes) and the gradients computed for each of its outputs to produce gradients for each of its inputs.
Page 7 of Ilya Sutskever's PhD thesis has a nice explanation of how this process works in general.
I know how the step transfer function works but how does the linear transfer function work? What equation do you use?
Relate answer to AND gate with two inputs and a bias
First of all, in general you want to apply linear transfer function only in the output layer of an MLP and "never" in the hidden layers, where non-linear transfer functions are typically used (logistic function, step. etc.).
Linear transfer function (in the form of f(x) = x for pure linear or purelin as it is mentioned in literature) is typically used for function approximation / regression tasks (this is intuitive because step and logistic functions give binary results where the linear function gives continuous results).
Non- linear transfer functions are used for classification tasks.
Non-linear transfer function(aka: activation function) is the most important factor which assigns the nonlinear approximation capability to the simple fully connected multilayer neural network.
Nevertheless, 'linear' activation function, of course, is one of the many alternatives you might want to adopt. But the problem is, pure linear transfer(f(x) = x) in hidden layers doesn't make sense for us, which means it may be 'in vain' if we try to train a network whose hidden units are activated by pure linear function.
We may understand this process with the following:
Assuming f(x)=x is our activation function, and we try to train a single hidden layer network having 2 input units(x1,x2), 3 hidden units(a1,a2,a3) and 1 output unit(y).
Hence, the network tries to approximate the function :
# hidden units
a1 = f(w11*x1+w12*x2+b1) = w11*x1+w12*x2+b1
a2 = f(w21*x1+w22*x2+b2) = w21*x1+w22*x2+b2
a3 = f(w31*x1+w32*x2+b3) = w31*x1+w32*x2+b3
# output unit
y = c1*a1+c2*a2+c3*a3+b4
if we combine all these equations, it turns out:
y = c1(w11*x1+w12*x2+b1) + c2(w21*x1+w22*x2+b2) + c3(w31*x1+w32*x2+b3) + b4
= (c1*w11+c2*w21+c3*w31)*x1 + (c1*w12+c2*w22+c3*w32)*x2 + (c1*b1+c2*b2+c3*b3+b4)
= A1*x1+A2*x2+C
As shown above, linear activation degenerate the network into a single input-output linear product, regardless of the structure of the network. What was done during the training process is factorizing A1, A2 and C into various factors.
Even one very popular quasi-linear activation function call Relu in deep neural network is also rectified. In other words, no pure linear activation in hidden layers is used unless you want to factorize coefficients.
I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.