Confusion about many-to-one, many-to-many LSTM architectures in Keras - neural-network

My task is the following: I have a (black box) method which computes a sequence starting from an initial element. At each step, my method reads an input from an external source of memory and outputs an action which potentially changes this memory. You can think of this method as a function f: (external state, reading) -> action. I want to train an ANN to learn f(), which means I want to be able to take my trained model, feed it an input, get the predicted action, use it to change the external state and repeat this process indefinitely, one step at a time.
Because of the nature of f() I know that the ANN must be recurrent and stateful, but I'm not so sure about the rest. It makes sense to train it to map sequences of readings into sequences of actions, but it only makes sense if the model is able to fuse each reading with the action outputted in the last step, and I'm not sure how to enforce that.
But most importantly: After training my model with a given sequence length (readings^N -> actions^N), how can I make it output predictions one step at a time (sequence length = 1)? Is this possible?

Related

How can a connection between one gate input with mutiple outputs of other gates causes circuit memory?

I'm reading the Digital Design and Computer Architecture by David Harris, Sarah Harris. The authors give the following definition of combinational logic:
A combinational circuit’s outputs depend only on the current values of
the inputs; in other words, it combines the current input values to
compute the output... A combinational circuit is memoryless, but a
sequential circuit has memory. The functional specification of a
combinational circuit expresses the output values in terms of the
current input values.
However, they claim this circuit is not combinational:
because "node n6 connects to the output terminals of both I3 and I4". Indeed, it's one of the designated signs when a scheme can not be combinational but, according to the authors:
Certain circuits that disobey these rules are still combinational, so
long as the outputs depend only on the current values of the inputs.
As I'm able to catch on, the aforementioned circuit is the case: its output is 1 if and only if its inputs are both 1, otherwise the output is 0. So the output is defined as a function of the inputs (the AND function).
In fact, there was already a question about this circuit in the computer science network and it has an accepted answer. Here's an excerpt from it:
Circuit (d) cannot be written in this form [of formula], since the
outputs of I3 and I4 are wired together. What is the relation between
the input to the rightmost gate and the outputs of I3 and I4? Not
something that can be described combinatorially.
Unfortunately, I'm still confused due to
The circuit, regarded as a black box, is still in scope of the combinational logic definition: its output values depend only on the current values of the inputs;
The relation between the input to the rightmost gate and the outputs of I3 and I4 can be described through the function NAND of the circuit inputs and this function is quite "memoryless". It's not obvious for me why we can't afford to depict a gate input using multiple outputs of other gates.
I need some elaboration. Maybe things would fall into place if someone provide a circuit example when two gates outputs is connected to one input and it actually causes "memory" (in contrast to the considered sample).
Circuit (d) is not combinational because it is not a logic gate circuit at all.
I think it's a very silly example to explain combinational vs sequential circuits.
In a logic circuit, an output wire cannot go to another output wire. You assumed that the outputs, when connected together, will act as a logical OR or AND of themselves.
This is not true (otherwise why would we use AND/OR gates in the first place?).
What will happen depends on the specific implementation of the gates (i.e. specific IC or manufacturing process you used) and this is not something that a logic circuit is meant to model.
A logic circuit must behave the same, no matter what brand you are using.
In circuit (d), the output of I3 will feed both the input of the rightmost NOT and the output of I4 (the complementary is also true).
Most IC will break if a current will flow in from their outputs, others won't but they will interfere with the capability of the right-most NOT to sense its input.
Logic circuits are still circuits, so you should, in theory, perform a full circuit analysis, which includes solving differential equations, to solve for their output.
Digital electronics is a branch that abstracts from these "low-level" details but at the cost of making some assumptions, one of which is: outputs are never merged without a gate.
The whole point of a combinational circuit is that you can write out = f(in0, in1, ..., ink) but it's not always possible.
Take for example an edge detector, it is just a f(A) = (NOT A) AND A which should, by the law of the excluded middle, always output 0.
But it will not because the NOT A path takes a slightly longer time to reach the AND input.
How can you describe this dynamic behaviour with a f(A) function?
Don't think too much of it, when you'll get to sequential circuits you'll spot the difference immediately (if you need a preview, look up for "latch circuit").

Difference in Workspace and Simulink step response. Why this difference?

I had as primary objective to make a controller for the transfer function (5.551* s^2), using root locus I made the controller shown below. Analyzing the step response in the Workspace using the step () function I had a satisfactory answer but when I try to transfer this answer to Simulink the response behaves differently, at steady state for example I wish to have the smallest possible error as it was obtained in Workspace but in Simulink there is a big error and for some reason at 8 seconds time (Simulink simulation time) there is a "jump" as shown on the display and when I change the simulation time there is a change in this "jump" too and I do not know why these changes between one environment and another.
Step response in Workspace
Step response in Simulink with 8s of simulation
Step response in Simulink with 12s of simulation
Simulink controller
Simulink transfer function
I expected to make a controller that has an error less than 5% and an overshoot smaller than 25%, so I first made a controller with two integrators to nullify the effects of zeros on the source, after that I added two more integrators on the source to try decrease the error, the zero at -0.652 I used the angular condition for this and the gain of 0.240251 I used the modular condition.
I wasn't expecting the most optimized behavior possible, just that it has minimum conditions that satisfy the imposed conditions, so I didn't worry for example about the four integrators at the source.
I tried use the sisotool() command thinking that I had done something wrong, but the result changed a lot when I was simulating Simulink so I discarded this option and kept the controller I made using root locus.
Your MATLAB code and your Simulink model are not the same, and hence the different results.
MATLAB allows you to define the non-causal plant model P_ball, then form the causal closed loop CL, which can have its step response generated.
Simulink does not allow you to model non-causal blocks (even if the overall model is causal) and hence will not allow you to implement s^2, which I assume is why you have used two differentiation blocks. But a numerical differentiation is not the same as a Laplace s operator.
You would have to make the plant causal by incorporating two poles that are large enough to not adversely effect the overall simulation. So your plant model needs to be something like 5.551*s^2/((s/1000 + 1)(s/1000 + 1)) which can be implemented using a Transfer Function block with a numerator of 5.551*1000*1000*[1 0 0] and a denominator of [1 2*1000 1000*1000].
Alternatively you could just implement PID * P_ball (where you manually do the 2 zero/pole cancellations) which is causal.

What does the parameter retain_graph mean in the Variable's backward() method?

I'm going through the neural transfer pytorch tutorial and am confused about the use of retain_variable(deprecated, now referred to as retain_graph). The code example show:
class ContentLoss(nn.Module):
def __init__(self, target, weight):
super(ContentLoss, self).__init__()
self.target = target.detach() * weight
self.weight = weight
self.criterion = nn.MSELoss()
def forward(self, input):
self.loss = self.criterion(input * self.weight, self.target)
self.output = input
return self.output
def backward(self, retain_variables=True):
#Why is retain_variables True??
self.loss.backward(retain_variables=retain_variables)
return self.loss
From the documentation
retain_graph (bool, optional) – If False, the graph used to compute
the grad will be freed. Note that in nearly all cases setting this
option to True is not needed and often can be worked around in a much
more efficient way. Defaults to the value of create_graph.
So by setting retain_graph= True, we're not freeing the memory allocated for the graph on the backward pass. What is the advantage of keeping this memory around, why do we need it?
#cleros is pretty on the point about the use of retain_graph=True. In essence, it will retain any necessary information to calculate a certain variable, so that we can do backward pass on it.
An illustrative example
Suppose that we have a computation graph shown above. The variable d and e is the output, and a is the input. For example,
import torch
from torch.autograd import Variable
a = Variable(torch.rand(1, 4), requires_grad=True)
b = a**2
c = b*2
d = c.mean()
e = c.sum()
when we do d.backward(), that is fine. After this computation, the parts of the graph that calculate d will be freed by default to save memory. So if we do e.backward(), the error message will pop up. In order to do e.backward(), we have to set the parameter retain_graph to True in d.backward(), i.e.,
d.backward(retain_graph=True)
As long as you use retain_graph=True in your backward method, you can do backward any time you want:
d.backward(retain_graph=True) # fine
e.backward(retain_graph=True) # fine
d.backward() # also fine
e.backward() # error will occur!
More useful discussion can be found here.
A real use case
Right now, a real use case is multi-task learning where you have multiple losses that maybe be at different layers. Suppose that you have 2 losses: loss1 and loss2 and they reside in different layers. In order to backprop the gradient of loss1 and loss2 w.r.t to the learnable weight of your network independently. You have to use retain_graph=True in backward() method in the first back-propagated loss.
# suppose you first back-propagate loss1, then loss2 (you can also do the reverse)
loss1.backward(retain_graph=True)
loss2.backward() # now the graph is freed, and next process of batch gradient descent is ready
optimizer.step() # update the network parameters
This is a very useful feature when you have more than one output of a network. Here's a completely made up example: imagine you want to build some random convolutional network that you can ask two questions of: Does the input image contain a cat, and does the image contain a car?
One way of doing this is to have a network that shares the convolutional layers, but that has two parallel classification layers following (forgive my terrible ASCII graph, but this is supposed to be three convlayers, followed by three fully connected layers, one for cats and one for cars):
-- FC - FC - FC - cat?
Conv - Conv - Conv -|
-- FC - FC - FC - car?
Given a picture that we want to run both branches on, when training the network, we can do so in several ways. First (which would probably be the best thing here, illustrating how bad the example is), we simply compute a loss on both assessments and sum the loss, and then backpropagate.
However, there's another scenario - in which we want to do this sequentially. First we want to backprop through one branch, and then through the other (I have had this use-case before, so it is not completely made up). In that case, running .backward() on one graph will destroy any gradient information in the convolutional layers, too, and the second branch's convolutional computations (since these are the only ones shared with the other branch) will not contain a graph anymore! That means, that when we try to backprop through the second branch, Pytorch will throw an error since it cannot find a graph connecting the input to the output!
In these cases, we can solve the problem by simple retaining the graph on the first backward pass. The graph will then not be consumed, but only be consumed by the first backward pass that does not require to retain it.
EDIT: If you retain the graph at all backward passes, the implicit graph definitions attached to the output variables will never be freed. There might be a usecase here as well, but I cannot think of one. So in general, you should make sure that the last backwards pass frees the memory by not retaining the graph information.
As for what happens for multiple backward passes: As you guessed, pytorch accumulates gradients by adding them in-place (to a variable's/parameters .grad property).
This can be very useful, since it means that looping over a batch and processing it once at a time, accumulating the gradients at the end, will do the same optimization step as doing a full batched update (which only sums up all the gradients as well). While a fully batched update can be parallelized more, and is thus generally preferable, there are cases where batched computation is either very, very difficult to implement or simply not possible. Using this accumulation, however, we can still rely on some of the nice stabilizing properties that batching brings. (If not on the performance gain)

Evaluating neural networks built with Comp Graph dl4j

I am trying to build a complex neural network using Computation Graph implementation in Deeplearning4J. I need to have multiple outputs so that's why I can't go with the generic MultiLayerConfiguration.
However, my problem is that in this case I do not know how to do the evaluation of my model and I would like at least to know the accuracy.
Has anybody worked with Comp Graphs in dl4j?
First of all yes: tons of people use computation graph. They usually start from our existing examples though and tend to mainly use it for things like seq2seq.
As for your question on evaluation, it's conceptually the same as multi layer network. How you evaluate is likely going to be task specific though. If you think about where evaluation happens, it's always tied to a task (classification,regression,binary classification,..) with an output layer . In the most common case usually you only have 1 output which outputs a classification. In that case you can just use the first array it outputs.
Otherwise for multiple outputs..you'd have to define what you're evaluating. Usually tasks merge to 1 path.
If they don't, you'd have multiple output layers where you want to do an evaluation object per output.
Computation graphs and multi layer network both use a .output method to give you raw arrays. That is typically what you pass to eval.eval.

Backpropagation through time in stateful RNNs

If I use a stateful RNN in Keras for processing a sequence of length N divided into N parts (each time step is processed individually),
how is backpropagation handled? Does it only affect the last time step, or does it backpropagate through the entire sequence?
If it does not propagate through the entire sequence, is there a way to do this?
The back propagation horizon is limited to the second dimension of the input sequence. i.e. if your data is of type (num_sequences, num_time_steps_per_seq, data_dim) then back prop is done over a time horizon of value num_time_steps_per_seq Take a look at
https://github.com/fchollet/keras/issues/3669
There are a couple things you need to know about RNNs in Keras. At default the parameter return_sequences=False in all recurrent neural networks. This means that at default only the activations of the RNN after processing the entire input sequence are returned as output. If you want to have the activations at every time step and optimize every time step seperately, you need to pass return_sequences=True as parameter (https://keras.io/layers/recurrent/#recurrent).
The next thing that is important to know is that all a stateful RNN does is remember the last activation. So if you have a large input sequence and break it up in smaller sequences (which I believe you are doing), the activation in the network is retained in the network after processing the first sequence and therefore affects the activations in the network when processing the second sequence. This has nothing to do with how the network is optimized, the network simply minimizes the difference between the output and the targets you give.
to the Q1: how is backpropagation handled? (as so as RNN is not only fully-connected vertically as in basic_NN, but also considered to be Deep - having also horizontal backprop connections in hidden layer)
Suppose batch_input_shape=(num_seq, 1, data_dim) - "Backprop will be truncated to 1 timestep , as the second dimension is 1. No gradient updates will be performed further back in time than the second dimension's value." - see here
Thus, if having time_step >1 there - gradient WILL update further back in time_steps assigned in second_dim of input_shape
set return_sequences=True for all recurrent layers except the last one (that use as needed output or Dense further to needed output) -- True is needed to have transmissible sequence from previous to the next rolled at +1 in sliding_window -- to be able to backprop according already estimated weights
return_state=True is used to get the states returned -- 2 state tensors in LSTM [output, state_h, state_c = layers.LSTM(64, return_state=True, name="encoder")] or 1 state tensor in GRU [incl. in shapes] -- that "can be used in the encoder-decoder sequence-to-sequence model, where the encoder final state is used as the initial state of the decoder."...
But remember (for any case): Stateful training does not allow shuffling, and is more time-consuming compared with stateless
p.s.
as you can see here -- (c,h) in tf or (h,c) in keras -- both h & c are elements of output, thus both becoming urgent in batched or multi-threaded training