I have gone through neural networks and have understood the derivation for back propagation almost perfectly(finally!). However, I had a small doubt.
We are updating all the weights simultaneously, so what is the guarantee that they lead to a smaller cost. If the weights are updated one by one, it would definitely lead to a lesser cost and it would be similar to linear regression. But if you update all the weights simultaneously, might we not cross the minima?
Also, do we update the biases like we update the weights after each forward propagation and back propagation of each test case?
Lastly, I have started reading on RNN's. What are some good resources to understand BPTT in RNN's?
Yes, updating only one weight at the time could result in decreasing error value every time but it's usually infeasible to do such updates in practical solutions using NN. Most of today's architectures usually have ~ 10^6 parameters so one epoch for every parameter could last enormously long. Moreover - because of nature of backpropagation - you usually have to compute loads of different derivatives in order to compute derivative with respect to a parameter given - so you will waste a lot of computations when using such approach.
But the phenomenon which you mention has been noticed a long time ago and there are some ways in dealing with it. There are two most common issues connected with it:
Covariance shift: it's when error and weight updates of a layer given strongly depends on output from previous layer, so when you update it - the results in the next layer might be different. The most common way to deal with this problem right now is Batch normalization.
Nolinear function vs Linear Differentation: it's quite uncommon when you think about BP but derivative is a linear operator which might generate a lot of problems in gradient descent. The most countintuitive example is the fact that if you multiply your input by a constant then every derivative will also be multiplied by the same number. This may lead to a lot of problems but most of recent methods of learning do a great job in dealing with it.
About BPTT I stronly recomend you Geoffrey Hinton course about ANN and especially this video.
Related
I've started working on Forward and back propagation of neural networks. I've coded it as-well and works properly too. But i'm confused in the algorithm itself. I'm new to Neural Networks.
So Forward propagation of neural networks is finding the right label with the given weights?
and Back-propagation is using forward propagation to find the most error free parameters by minimizing cost function and using these parameters to help classify other training examples? And this is called a trained Neural Network?
I feel like there is a big blunder in my concept if there is please let me know where i'm wrong and why i am wrong.
I will try my best to explain forward and back propagation in a detailed yet simple to understand manner, although it's not an easy topic to do.
Forward Propagation
Forward propagation is the process in a neural network where-by during the runtime of the network, values are fed into the front of the neural network, (the inputs). You can imagine that these values then travel across the weights which multiply the original value from the inputs by themselves. They then arrive at the hidden layer (neurons). Neurons vary quite a lot based on different types of networks, but here is one way of explaining it. When the values reach the neuron they go through a function where every single value being fed into the neuron is summed up and then fed into an activation function. This activation function can be very different depending on the use-case but let's take for example a linear activation function. It essentially gets the value being fed into it and then it rounds it to a 0 or 1. It is then fed through more weights and then it is spat out into the outputs. Which is the last step into the network.
You can imagine this network with this diagram.
Back Propagation
Back propagation is just like forward propagation except we work backwards from where we were in forward propagation.
The aim of back propagation is to reduce the error in the training phase (trying to get the neural network as accurate as possible). The way this is done is by going backwards through the weights and layers. At each weight the error is calculated and each weight is individually adjusted using an optimization algorithm; optimization algorithm is exactly what it sounds like. It optimizes the weights and adjusts their values to make the neural network more accurate.
Some optimization algorithms include gradient descent and stochastic gradient descent. I will not go through the details in this answer as I have already explained them in some of my other answers (linked below).
The process of calculating the error in the weights and adjusting them accordingly is the back-propagation process and it is usually repeated many times to get the network as accurate as possible. The number of times you do this is called the epoch count. It is good to learn the importance of how you should manage epochs and batch sizes (another topic), as these can severely impact the efficiency and accuracy of your network.
I understand that this answer may be hard to follow, but unfortunately this is the best way I can explain this. It is expected that you might not understand this the first time you read it, but remember this is a complicated topic. I have a linked a few more resources down below including a video (not mine) that explains these processes even better than a simple text explanation can. But I also hope my answer may have resolved your question and have a good day!
Further resources:
Link 1 - Detailed explanation of back-propagation.
Link 2 - Detailed explanation of stochastic/gradient-descent.
Youtube Video 1 - Detailed explanation of types of propagation.
Credits go to Sebastian Lague
I am using Matlab for this project. I have introduced some modifications to the ode45 solver.
I am using sometimes up to 64 components, all in the [0,1] interval and the components sum up to 1.
At some intervals I halt the integration process in order to run a quick check to see whether further integration is needed and I am looking for some clever way to efficiently figure this one.
I have found four cases and I should be able to detect each of them during a check:
1: The system has settled into an equilibrium and all components are unchanged.
2: Three or more components are wildly fluctuating in a periodic manner.
3: One or two components are changing very rapidly with low amplitude and short frequency.
4: None of the above is true and the integration must be continued.
To give an idea: I have found it to be a good practice to use the last ~5k states generated by the ode45 solver to a function for this purpose.
In short: how does one detect equilibrium or a nonchanging periodic pattern during ODE integration?
Steady-state only occurs when the time derivatives your model function computes are all 0. A periodic solution like you described corresponds rather to a limit cycle, i.e. oscillations around an unstable equilibrium. I don't know if there are methods to detect these cycles. I might update my answers to give more info on that. Maybe an idea would be to see if the last part of the signal correlates with itself (with a delay corresponding to the cycle period).
Note that if you are only interested in the steady state, an implicit method like ode15s may be more efficient, as it can "dissipate" all the transient fluctuations and use much larger time steps than explicit methods, which must resolve the transient accurately to avoid exploding. However, they may also dissipate small-amplitude limit cycles. A pragmatic solution is then to slightly perturb the steady-state values and see if an explicit integration converges towards the unperturbed steady-state.
Something I often do is to look at the norm of the difference between the solution at each step and the solution at the last step. If this difference is small for a sufficiently high number of steps, then steady-state is reached. You can also observe how the norm $||frac{dy}{dt}||$ converges to zero.
This question is actually better suited for the computational science forum I think.
I am currently developing a deep reinforcement learning network however, I have a small doubt about the number of q-values I will have at the output of the NN. I will have a total of 150 q-values, which personally seems excessive to me. I have read on several papers and books that this could be a problem. I know that it will depend from the kind of NN I will build, but do you guys think that the number of q-values is too high? should I reduce it?
There is no general principle what is "too much". Everything depends solely on the problem and throughput one can get in learning. In particular number of actions does not have to matter as long as internal parametrisation of Q(a, s) is efficient. To give some example lets assume that the neural network is actually of form NN(a, s) = Q(a, s), in other words it accepts action as an input, together with the state, and outputs the Q value. If such an architecture can be trained in a problem considered, than it might be able to scale to big action spaces; on the other hand if the neural net basically has independent output per action, something of form NN(s)[a] = Q(a, s) then many actions can lead to relatively sparse learning signal for the model and thus lead to slow convergence.
Since you are asking about reducing action space it sounds like the true problem has complex control (maybe it is a continuous control domain?) and you are looking for some discretization to make it simpler to learn. If this is the case you will have to follow the typical approach of trial and error - try with simple action space, observe the dynamics, and if the results are not satisfactory - increase the complexity of the problem. This allows making iterative improvements, as opposed to going in the opposite direction - starting with too complex setting to get any results and than having to reduce it without knowing what are the "reasonable values".
I've implemented NN for OCR. My program had quite good percentage of successful recognitions, but recently ( two months ago ) it's performance decreased at ~23%. After analyzing data, I noticed that some some new irregularities in images appeared ( additional twistings, noise ). In other words, my nn needed to learn some new data, but also it was needed to make sure, that it will not forget old data. In order to achieve it I trained NN on mixture of old and new data and very tricky feature that I tried was prevent weights from changing to much ( initially I limited changes not more then 3%, but later accepted 15%). What else can be done in order to help NN not to "forget" old data?
This is a great question that is currently being actively researched.
It sounds to me as if your original implementation had over-learned from its original dataset, making it unable to effectively generalize for new data. There are many techniques available to prevent this from happening:
Make sure that your network is the smallest size that can still solve the problem.
Use some form of regularization technique. One of my favorites (and the current favorite of researchers) is the dropout technique. Basically every time you feed forward, every neuron has a percent chance of returning 0 instead of its typical activation. Other common techniques include L1, L2 and weight decay.
Play with your learning constant. Perhaps your constant is too high.
Finally continue training in the way you described. Create a buffer of all datapoints (new and old) and train on randomly chosen points in a random order. This will help make sure your network is not falling into a local minimum.
Personally I would try these techniques before trying to limit how a neuron can learn on every iteration. If you are using Sigmoid or Tanh activation, then values around .5 (sigmoid) or 0 (tanh) will have a large derivative and will change rapidly which is one of the advantages of these activations. To achieve a similar, but less obtrusive effect: play with your learning constant. I'm not sure of the size of your net, or the amount of samples you have, but try a learning constant of ~.01
After profiling my Neural Nets' code I've realized that the method, which computes the weight changes for each arc in the network (-rate*gradient + momentum*previous_delta - decay*rate*weight), already given the gradient, is the bottleneck (55% inclusive samples).
Is there any trick to compute these values in a efficient manner?
This is normal behaviour. I am assuming that you are using an iterative process to solve the weights at each evolution step (such as backpropagation?). If the number of neurons is large and the training (back-testing) algorithm is short, then it is normal that weight mutation such as this will consume a larger fraction of compute time during training of the neural network.
Did you get this result using a simple XOR problem or similar? If so, you will probably find that if you start to solve more complex problems (such as pattern detection in multidimensional arrays, image processing, etc.) that those functions will begin to consume an insignificant fraction of compute time.
If you are profiling, I would suggest you profile with a problem that is closer to the purpose for which the neural network is designed (I am guessing you didn't design it to solve XOR or play tic tac toe) and you will probably find that optimising code such as -rate*gradient + momentum*previous_delta - decay*rate*weight is more or less a waste of time, at least this is my experience.
If you do find that this code is compute-intensive in real-world applications then I would suggest trying to reduce the number of times this line of code is executed via structural changes. Neural network optimization is a rich field and I can't possibly give you useful advise from such a broad question, but I will say that if your program is unusually slow, you're unlikely to see significant improvements by tinkering at such low-level code. I will however suggest the following from my own experience:
Consider parallelisation. Many search algorithms such as those implemented in back-propagation techniques are amenable to parallel attempts to improve convergence. As weight-adjustments are identical in terms of computation demand for a given network, think static loops in Open MP.
Modify the convergence criterion (the critical convergence rate before you stop adjustments of weights) to perform less of these calculations
Consider an alternative to deterministic solutions such as back-propagations, which are slightly more prone to local optimisation anyway. Consider gaussian mutation (All things being equal gaussian mutation will 1) reduce time spent on mutation relative to backtesting 2) increase convergence time and 3) be less prone to getting caught in local minima of the error search space)
Please note that this is a non-technical answer to what I have interpreted as a non-technical question.