LSTM NN: forward propagation - neural-network

I am new to neural nets, and am creating a LSTM from scratch. I have the forward propagation working...but I have a few questions about the moving pieces in forward propagation in the context of a trained model, back propagation, and memory management.
So, right now, when I run forward propagation, I stack the new columns, f_t, i_t, C_t, h_t, etc on their corresponding arrays as I accumulate previous positions for the bptt gradient calculations.
My question is 4 part:
1) How far back in time do I need to back propagate in order to retain reasonably long-term memories? (memory stretching back 20-40 time steps is probably what I need for my system (although I could benefit from a much longer time period--that is just the minimum for decent performance--and I'm only shooting for the minimum right now, so I can get it working)
2) Once I consider my model "trained," is there any reason for me to keep more than the 2 time-steps I need to calculate the next C and h values? (where C_t is the Cell state, and h_t is the final output of the LSTM net) in which case I would need multiple versions of the forward propagation function
3) If I have limited time series data on which to train, and I want to train my model, will the performance of my model converge as I train it on the training data over and over (as versus oscillate around some maximal average performance)? And will it converge if I implement dropout?
4) How many components of the gradient do I need to consider? When I calculate the gradient of the various matrices, I get a primary contribution at time step t, and secondary contributions from time step t-1 (and the calculation recurses all the way back to t=0)? (in other words: does the primary contribution dominate the gradient calculation--will the slope change due to the secondary components enough to warrant implementing the recursion as I back propagate time steps...)

As you have observed, it depends on the dependencies in the data. But LSTM can learn to learn longer term dependencies even though we back propagate only a few time steps if we do not reset the cell and hidden states.
No. Given c_t and h_t, you can determine c and h for the next time step. Since you don't need to back propagate, you can throw away c_t (and even h_t if you are just interested in the final LSTM output)
You might converge if you start over fitting. Using Dropout will definitely help avoiding that, especially along with early stopping.
There will be 2 components of gradient for h_t - one for current output and one from the next time step. Once you add the both, you won't have to worry about any other components

Related

Reinforcement Learning- Won't Converge

I'm working on my bachelor thesis.
My topic is reinforcement learning. The Setup:
Unity3D (C#)
Own neural network framework
Confirmed the network working by testing to training a sine-function.
It can approximate it. Well. there are some values which won't get to their desired value but it's good enough.
When training it with single Values it always converges.
Here is my problem:
I try to teach my network the Q-Value-Function of a simple game,
catch balls:
In this game it just has to catch a ball dropping from random position and with random angle.
+1 if catch
-1 if failed
My network-model has 1 hidden layer with neurons ranging from 45-180 (i tested this numbers with no success)
It uses replay with 32 samples from a 100k memory with a learning-rate of 0.0001
It learns for 50000 frames then tests for 10000 frames. This happens 10 times.
Inputs are PlatformPosX, BallPosX, BallPosY from the last 4 frames
Pseudocode:
Choose action (e-greedy)
Do action,
Store state action, CurrentReward. Done in memory
if in learnphase: Replay
My problem is:
Its actions starts clipping to either 0 or 1 with some variance sometimes.
It never has a ideal policy like if the platform would just follow the ball.
EDIT:
Sorry for cheap info...
My Quality-Function is trained by:
Reward + Gamma(nextEstimated_Reward)
So its discounting.
Why would you possibly expect that to work?
Your training can barely approximate a 1-dimensional function. And now you expect it to solve a 12-dimensional function which involves a differential equation? You should have verified first whether your training does even converge for a multi dimensional function at all, with the chosen training parameters.
Your training, given the little detail you provided, also appears to be unsuitable. There is hardly a chance it ever successfully catches the ball, and even when it does, you are rewarding it mostly for random outputs. Only correlation between in- and output is in the last few frames when the pad can only reach the target in time by a limited set of possible actions.
Then there is the choice of inputs. Don't require your model to differentiate by itself. Relevant inputs would had been x, y, dx, dy. Preferably even x, y relative to pad position, not world. Should have a much better chance to converge. Even if it was only learning to keep x minimal.
Working with absolute world coordinates is pretty much bound to fail, as it would require the training to cover the entire range of possible input combinations. And also the network to be big enough to even store all the combinations. Be aware that the network isn't learning the actual function, it's learning an approximation for every single possible set of inputs. Even if the ideal solution is actually just a linear equation, the non linear properties of the activation function make it impossible to learn it in a generalized form for unbound inputs.

Predictive curve fitting matlab

I have a question about curve fitting, I have many curves like the one in the picture.
X axis : time
Y axis : temperature
Each sample comes out every 30s.
GOAL : predict the value at the end of the transient
What would you do in this situation?
What I am doing is this :
for every new sample I start a new fitting (and so each fitting is independent from the previous one) and check the value of the fitted curve 2 hours (all curves I have set before 2h) after the start of the measurement. If for a number (let's say 5) of subsequent fitting the value in the future stays more or less the same(+-0.2°C) I so assume that the estimation is the right one.
This approach seems to me far too simple and I think I am not exploiting all information. For example the info of the error I am making punctually (e.g. at minute 4:00 I predict and at 4:30 I see that I am doing an error).
In the picture the red part of the curve is excluded (but the real data in the future passes through it). the estimation is the blue one. You see in this case I don't have a good prediction... In general I have also more flat curves.
Based on the comments above, I tried to formulate an answer as no one else is giving some input.
I think your are using a good basic procedure. Better results may be obtained by using a more appropriate fitting curve, which includes all the dominant dynamics, but avoids overfitting of the data. Based on your figure, the simplest form I could think of is:
s + a(1-e^(-t/tau))
with parameters s (the initial temperature), a (amplitude = steady state value) and tau (dominant time constant). As you mentioned yourself, limiting the allowed range for the parameters may avoid overfitting and increase the quality of your estimation.
Using a random high order function, like you are using now, may give good interpolation results, but are dangerous to use for extrapolation, because strange effects may occur outside the fitting region.
Alternatives
Using the error (eg. correcting for the extrapolated error) may be possible, but is tricky and may not always give good results.
Training a neural network to perform the estimation is probably overkill, but may give better results if applied correctly. Note that you need a lot of training data which should be representative for the data for which you will use the neural network later on.

Neural Networks back propogation

I have gone through neural networks and have understood the derivation for back propagation almost perfectly(finally!). However, I had a small doubt.
We are updating all the weights simultaneously, so what is the guarantee that they lead to a smaller cost. If the weights are updated one by one, it would definitely lead to a lesser cost and it would be similar to linear regression. But if you update all the weights simultaneously, might we not cross the minima?
Also, do we update the biases like we update the weights after each forward propagation and back propagation of each test case?
Lastly, I have started reading on RNN's. What are some good resources to understand BPTT in RNN's?
Yes, updating only one weight at the time could result in decreasing error value every time but it's usually infeasible to do such updates in practical solutions using NN. Most of today's architectures usually have ~ 10^6 parameters so one epoch for every parameter could last enormously long. Moreover - because of nature of backpropagation - you usually have to compute loads of different derivatives in order to compute derivative with respect to a parameter given - so you will waste a lot of computations when using such approach.
But the phenomenon which you mention has been noticed a long time ago and there are some ways in dealing with it. There are two most common issues connected with it:
Covariance shift: it's when error and weight updates of a layer given strongly depends on output from previous layer, so when you update it - the results in the next layer might be different. The most common way to deal with this problem right now is Batch normalization.
Nolinear function vs Linear Differentation: it's quite uncommon when you think about BP but derivative is a linear operator which might generate a lot of problems in gradient descent. The most countintuitive example is the fact that if you multiply your input by a constant then every derivative will also be multiplied by the same number. This may lead to a lot of problems but most of recent methods of learning do a great job in dealing with it.
About BPTT I stronly recomend you Geoffrey Hinton course about ANN and especially this video.

How to prevent NN from forgetting old data

I've implemented NN for OCR. My program had quite good percentage of successful recognitions, but recently ( two months ago ) it's performance decreased at ~23%. After analyzing data, I noticed that some some new irregularities in images appeared ( additional twistings, noise ). In other words, my nn needed to learn some new data, but also it was needed to make sure, that it will not forget old data. In order to achieve it I trained NN on mixture of old and new data and very tricky feature that I tried was prevent weights from changing to much ( initially I limited changes not more then 3%, but later accepted 15%). What else can be done in order to help NN not to "forget" old data?
This is a great question that is currently being actively researched.
It sounds to me as if your original implementation had over-learned from its original dataset, making it unable to effectively generalize for new data. There are many techniques available to prevent this from happening:
Make sure that your network is the smallest size that can still solve the problem.
Use some form of regularization technique. One of my favorites (and the current favorite of researchers) is the dropout technique. Basically every time you feed forward, every neuron has a percent chance of returning 0 instead of its typical activation. Other common techniques include L1, L2 and weight decay.
Play with your learning constant. Perhaps your constant is too high.
Finally continue training in the way you described. Create a buffer of all datapoints (new and old) and train on randomly chosen points in a random order. This will help make sure your network is not falling into a local minimum.
Personally I would try these techniques before trying to limit how a neuron can learn on every iteration. If you are using Sigmoid or Tanh activation, then values around .5 (sigmoid) or 0 (tanh) will have a large derivative and will change rapidly which is one of the advantages of these activations. To achieve a similar, but less obtrusive effect: play with your learning constant. I'm not sure of the size of your net, or the amount of samples you have, but try a learning constant of ~.01

3rd-order rate limiter in Simulink? How to generate smooth triggered signals?

First for those, who are not familiar with Simulink, there is a imaginable outside-Simulink partial solution:
I need to create a vector satisfying the following conditions:
known initial value a1
known final value a2
it has a pre-defined step size, but the length is not pre-determined
the first derivative over the whole range is limited to v_max resp. -v_max
the second derivative over the whole range is limited to a_max resp. -a_max
the third derivative over the whole range is limited to j_max resp. -j_max
at the first and the final point all derivatives are zero.
Before you ask "what have you tried so far", I just had the idea to solve it outside Simulink and I tried the whole stuff below ;)
But maybe you guys have a good idea, while I keep working on my own solution.
I'd like to generate smooth ramp signals (3rd derivative limited) based on a trigger signal in Simulink.
To get a triggered step I created a triggered subsystem propagating the trigger output. It looks like that:
But I actually don't want a step, I need a very smooth ramp with limited derivatives up to the 3rd order. The math behind is:
displacement: x
speed: v = x'
acceleration: a = v' = x''
jerk: j = a' = v'' = x'''
(If this looks familiar to you, I once had a very similar question. I thought about a bounty on it, but after the necessary edit of the question both answers would have been invalid)
As there are just rate limiters of 1st order, I used two derivates and a double integration to resolve my problem. But there is a mayor drawback, I can not ignore anymore. For the sake of illustration I chose a relatively big step size of 0.1.
The complete minimal example (Fixed Step, stepsize: 0.1, ode4): Download here
It can be seen, that the signal not even reaches the intended step height of 10 and furthermore is not constant at the end.
Over the development process of my whole model, this approach was satisfactory enough for small step sizes. But I reached the point where I really need the smooth ramp as intended. That means I need a finally constant signal at exactly the value, specified by the step height gain.
I already spent days to resolve the problem, and hope to fine some help here now.
Some of my ideas:
dynamically increase the step height over the actual desired value and saturate the final output. If the rate limits,step height and the simulation step size wouldn't be flexible one could probably find a satisfying solution. But as everything has to be flexible, there are too much cases where the acceleration and jerk limit is violated.
I tried to use the Matlab function block and write my own 3rd order rate limiter. Though it seems possible for me for the trigger moment, I have no solution how to smooth the "deceleration" at the end of the ramp. Also I'd need C-compilers, which would make it hard to use my model on other systems without problems. (At least I think so.)
The solver can not be changed siginificantly (either ode3 or ode4) and a fixed step size is mandatory (0.00001 to 0.01).
Currently used, not really useful approach:
For a dynamic amplification of 1.07 I get the following output (all values normalised on their limits):
Though the displacement looks nice, the violation of the acceleration limit is very harmful.
For a dynamic amplification of 1.05 I get the following output (all values normalised on their limits):
The acceleration stays in its boundaries, but the displacement does not reach the intended value. (not really clear in the picture) The jerk is still to big. (I could live with that, but it's not nice)
So it appears to me that a inside-Simulink solutions is far from reality. Any ideas how to create a well-behaving custom function block?
Simulation step size, step height, and the rate limits are known before the simulation starts. (But I have a lot of these triggered smooth ramps in a row, it should feed a event-discrete control). So I could imagine to create the whole smooth ramp outside simulink and save it as a timeseries object and append it on the current signal when the trigger is activated.
The problems you see are because the difference is not conditioned very well.
Taking the difference amplifies the numerical that exists in your simulation.
Also the jerk will always be large if you try to apply an actual step.
I guess for your approach it would be better to work the other way around:
i.e. make a jerk, acceleration and velocity with which your step is achieved.
I think your looking for something like the ref3 block:
http://www.dct.tue.nl/home_of_ref3.htm
Note the disclaimer on the site and that it is a little cumbersome to use.
An easy (yet to be improved) way is to use a rate limiter and then a state space model with a filter. From the filter you get the velocity, which in turn you can apply a rate limiter to. You continue with rate-limiter and filters until you have the desired curve.
Otherwise you can come up with numerical rate-limiters of higher order using ie runge kutta formulas or finite differences. However it was pointed out, that they may suffer from bad conditioning.
What I usually do is to use one rate limiter and a filter of 3rd Order and just tune the time constant (1 tripple pole), such that my needs are met. This works well, esp
Integrator chains of length > 1 are unstable!
There is a huge field of research dealing with trajectory planning. The easiest way might be to use FIR filters (Biagotti et al) or to implement an online trajectory planner (Ezair et al 2014 / Knierim et al 2012).