Understanding multi-layer LSTM - neural-network

I'm trying to understand and implement multi-layer LSTM. The problem is i don't know how they connect. I'm having two thoughs in mind:
At each timestep, the hidden state H of the first LSTM will become the input of the second LSTM.
At each timestep, the hidden state H of the first LSTM will become the initial value for the hidden state of the sencond LSTM, and the input of the first LSTM will become the input for the second LSTM.
Please help!

TLDR: Each LSTM cell at time t and level l has inputs x(t) and hidden state h(l,t)
In the first layer, the input is the actual sequence input x(t), and previous hidden state h(l, t-1), and in the next layer the input is the hidden state of the corresponding cell in the previous layer h(l-1,t).
From https://arxiv.org/pdf/1710.02254.pdf:
To increase the capacity of GRU networks (Hermans and
Schrauwen 2013), recurrent layers can be stacked on top of
each other.
Since GRU does not have two output states, the same output hidden state h'2
is passed to the next vertical layer. In other words, the h1 of the next layer will be equal to h'2.
This forces GRU to learn transformations that are useful along depth as well as time.

I am taking help of colah's blog post, just that I will cut short it to make you understand specific part.
As you can look at above image, LSTMs have this chain like structure and each have four neural network layer.
The values that we pass to next timestamp (cell state) and to next layer(hidden state) are basically same and they are desired output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to pass.
We also pass previous cell state information (top arrow to next cell) to next timestamp(cell state) and then decide using sigmoid layer(forget gate layer), how much information we are going to keep taking help of new input and input from previous state.
Hope this helps.

In PyTorch, multilayer LSTM's implementation suggests that the hidden state of the previous layer becomes the input to the next layer. So your first assumption is correct.

There's no definite answer. It depends on your problem and you should try different things.
The simplest thing you can do is to pipe the output from the first LSTM (not the hidden state) as the input to the second layer of LSTM (instead of applying some loss to it). That should work in most cases.
You can try to pipe the hidden state as well but I didn't see it very often.
You can also try other combinations. Say for the second layer you input the output of the first layer and the original input. Or you link to the output of the first layer from the current unit and the previous.
It all depends on your problem and you need to experiment to see what works for you.

Related

Confusion about many-to-one, many-to-many LSTM architectures in Keras

My task is the following: I have a (black box) method which computes a sequence starting from an initial element. At each step, my method reads an input from an external source of memory and outputs an action which potentially changes this memory. You can think of this method as a function f: (external state, reading) -> action. I want to train an ANN to learn f(), which means I want to be able to take my trained model, feed it an input, get the predicted action, use it to change the external state and repeat this process indefinitely, one step at a time.
Because of the nature of f() I know that the ANN must be recurrent and stateful, but I'm not so sure about the rest. It makes sense to train it to map sequences of readings into sequences of actions, but it only makes sense if the model is able to fuse each reading with the action outputted in the last step, and I'm not sure how to enforce that.
But most importantly: After training my model with a given sequence length (readings^N -> actions^N), how can I make it output predictions one step at a time (sequence length = 1)? Is this possible?

PReLU Activation Function update rule

I just finished reading Delving Deep into Rectifiers paper. This paper proposes a new activation function called PReLU. Maybe it is obvious, because the paper did not mention it, but I want to know when is the parameter 'a' of PReLU updated? Is it updated before weight update or after weight update? or is it simultaneously updated with weight?
The weights are all updated sequentially as the error signal propagates back through each layer of the network. So the bias, and 'a' parameter both update before passing the signal to the next layer and so before the the weight update below them in the network.

How to fine tune an FCN-32s for interactive object segmentation

I'm trying to implement the proposed model in a CVPR paper (Deep Interactive Object Selection) in which the data set contains 5 channels for each input sample:
1.Red
2.Blue
3.Green
4.Euclidean distance map associated to positive clicks
5.Euclidean distance map associated to negative clicks (as follows):
To do so, I should fine tune the FCN-32s network using "object binary masks" as labels:
As you see, in the first conv layer I have 2 extra channels, so I did net surgery to use pretrained parameters for the first 3 channels and Xavier initialization for 2 extras.
For the rest of the FCN architecture, I have these questions:
Should I freeze all the layers before "fc6" (except the first conv layer)? If yes, how the extra channels of the first conv will be learned? Are the gradients strong enough to reach the first conv layer during training process?
What should be the kernel size of the "fc6"? should I keep 7? I saw in "Caffe net_surgery" notebook that it depends on the output size of the last layer ("pool5").
The main problem is the number of outputs of the "score_fr" and "upscore" layers, since I'm not doing class segmentation (to use 21 for 20 classes and the background), how should I change it? What about 2? (one for object and the other for the non-object (background) area)?
Should I change "crop" layer "offset" to 32 to have center crops?
In case of changing each of these layers, what is the best initialization strategy for them? "bilinear" for "upscore" and "Xavier" for the rest?
Should I convert my binary label matrix values into zero-centered ( {-0.5,0.5} ) status, or it is OK to use them with the values in {0,1} ?
Any useful idea will be appreciated.
PS:
I'm using Euclidean loss, while I'm using "1" as the number of outputs for "score_fr" and "upscore" layers. If I use 2 for that, I guess it should be softmax.
I can answer some of your questions.
The gradients will reach the first layer so it should be possible to learn the weights even if you freeze the other layers.
Change the num_output to 2 and finetune. You should get a good output.
I think you'll need to experiment with each of the options and see how the accuracy is.
You can use the values 0,1.

Backpropagation through time in stateful RNNs

If I use a stateful RNN in Keras for processing a sequence of length N divided into N parts (each time step is processed individually),
how is backpropagation handled? Does it only affect the last time step, or does it backpropagate through the entire sequence?
If it does not propagate through the entire sequence, is there a way to do this?
The back propagation horizon is limited to the second dimension of the input sequence. i.e. if your data is of type (num_sequences, num_time_steps_per_seq, data_dim) then back prop is done over a time horizon of value num_time_steps_per_seq Take a look at
https://github.com/fchollet/keras/issues/3669
There are a couple things you need to know about RNNs in Keras. At default the parameter return_sequences=False in all recurrent neural networks. This means that at default only the activations of the RNN after processing the entire input sequence are returned as output. If you want to have the activations at every time step and optimize every time step seperately, you need to pass return_sequences=True as parameter (https://keras.io/layers/recurrent/#recurrent).
The next thing that is important to know is that all a stateful RNN does is remember the last activation. So if you have a large input sequence and break it up in smaller sequences (which I believe you are doing), the activation in the network is retained in the network after processing the first sequence and therefore affects the activations in the network when processing the second sequence. This has nothing to do with how the network is optimized, the network simply minimizes the difference between the output and the targets you give.
to the Q1: how is backpropagation handled? (as so as RNN is not only fully-connected vertically as in basic_NN, but also considered to be Deep - having also horizontal backprop connections in hidden layer)
Suppose batch_input_shape=(num_seq, 1, data_dim) - "Backprop will be truncated to 1 timestep , as the second dimension is 1. No gradient updates will be performed further back in time than the second dimension's value." - see here
Thus, if having time_step >1 there - gradient WILL update further back in time_steps assigned in second_dim of input_shape
set return_sequences=True for all recurrent layers except the last one (that use as needed output or Dense further to needed output) -- True is needed to have transmissible sequence from previous to the next rolled at +1 in sliding_window -- to be able to backprop according already estimated weights
return_state=True is used to get the states returned -- 2 state tensors in LSTM [output, state_h, state_c = layers.LSTM(64, return_state=True, name="encoder")] or 1 state tensor in GRU [incl. in shapes] -- that "can be used in the encoder-decoder sequence-to-sequence model, where the encoder final state is used as the initial state of the decoder."...
But remember (for any case): Stateful training does not allow shuffling, and is more time-consuming compared with stateless
p.s.
as you can see here -- (c,h) in tf or (h,c) in keras -- both h & c are elements of output, thus both becoming urgent in batched or multi-threaded training

Neural network to figure out the meaning of user input

I'm trying to make a neural network try to figure out the meaning of input(keyboard keys in this case) according to the user.
I have multiple possible output "commands" that the NN can interpret the inputs to mean, and at each state certain outputs can count as beneficial while others are a detriment
When the NN starts up for the first time, no input should have any particular meaning to it but as time goes on I want the NN to be able to figure out what the user most likely meant.
I've tried a Multilayer perceptron NN that has as many input nodes as there are physical inputs and as many output nodes as there are commands and a number of nodes equal to the sum of the other two layers as a single hidden layer, in this case it is then a 5,15,10.
The NN assumes that the user will only make moves that are in the NN's best interest.
So far it seems the NN is just figuring out what is the command it can take that will most likely result in a beneficial move, regardless of the input key rather than trying to figure out what key should result in what move according to the user.
Because of this I'm wondering (most likely wrong) if I should produce a separate NN for each input to try and figure out the current output according to the user.
Is there a different type of NN I should look into that will work better, and is there a recommended configuration for this problem?
I'll be happy with some recommendations of reading material that would help in this particular problem.
I'm at best an amateur in NN and would like to learn a lot more about the whole field, But I'm trying to focus my efforts on this problem for now.
Accordng to me you want the output to be according to behaviour of the player as number of inputs are more than in actual case. So according to me there should be some type of memory for the actions taken by the player in order to find the patterns.This can be done using Long Short term Memory.