When to reset the lottery ticket hypothesis algorithm to find initialization? - neural-network

According to The Lottery Ticket Hypothesis paper, there are two types of pruning strategies, one-shot pruning, and iterative pruning. Both are explained on page 2. Finding initialization for one-shot pruning is easy because we train the network for j iterations and then, reset the weights to the initialization using the obtained mask. What I do not understand is the iterative pruning. On page 2, it says:
we focus on iterative pruning, which repeatedly trains, prunes, and
resets the network over n rounds;
What does resets the network over n rounds mean? Does it mean, at each round of pruning we reset the network weights to the initialization using the obtained mask for the current level of pruning? Or it means, we do train and prune the network iteratively without resetting to the initialization, then after n levels of pruning, we will reset to the initialization using the last mask we have?

The weights are reset to the initial values every time.
The Lottery Ticket Hypothesis relies on the initial weights staying constant. If the starting weights are changed, then the masked subnetwork is no longer effective. So, they must be reset every time.
The authors demonstrated this point experimentally, and summarized on page 5.
This experiment supports the lottery ticket hypothesis’ emphasis on initialization:
the original initialization withstands and benefits from pruning, while the random reinitialization’s performance immediately suffers and diminishes steadily.

Related

Episodic Semi-gradient Sarsa with Neural Network

While trying to implement the Episodic Semi-gradient Sarsa with a Neural Network as the approximator I wondered how I choose the optimal action based on the currently learned weights of the network. If the action space is discrete I can just calculate the estimated value of the different actions in the current state and choose the one which gives the maximimum. But this seems to be not the best way of solving the problem. Furthermore, it does not work if the action space can be continous (like the acceleration of a self-driving car for example).
So, basicly I am wondering how to solve the 10th line Choose A' as a function of q(S', , w) in this pseudo-code of Sutton:
How are these problems typically solved? Can one recommend a good example of this algorithm using Keras?
Edit: Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
I wondered how I choose the optimal action based on the currently learned weights of the network
You have three basic choices:
Run the network multiple times, once for each possible value of A' to go with the S' value that you are considering. Take the maximum value as the predicted optimum action (with probability of 1-ε, otherwise choose randomly for ε-greedy policy typically used in SARSA)
Design the network to estimate all action values at once - i.e. to have |A(s)| outputs (perhaps padded to cover "impossible" actions that you need to filter out). This will alter the gradient calculations slightly, there should be zero gradient applied to last layer inactive outputs (i.e. anything not matching the A of (S,A)). Again, just take the maximum valid output as the estimated optimum action. This can be more efficient than running the network multiple times. This is also the approach used by the recent DQN Atari games playing bot, and AlphaGo's policy networks.
Use a policy-gradient method, which works by using samples to estimate gradient that would improve a policy estimator. You can see chapter 13 of Sutton and Barto's second edition of Reinforcement Learning: An Introduction for more details. Policy-gradient methods become attractive for when there are large numbers of possible actions and can cope with continuous action spaces (by making estimates of the distribution function for optimal policy - e.g. choosing mean and standard deviation of a normal distribution, which you can sample from to take your action). You can also combine policy-gradient with a state-value approach in actor-critic methods, which can be more efficient learners than pure policy-gradient approaches.
Note that if your action space is continuous, you don't have to use a policy-gradient method, you could just quantise the action. Also, in some cases, even when actions are in theory continuous, you may find the optimal policy involves only using extreme values (the classic mountain car example falls into this category, the only useful actions are maximum acceleration and maximum backwards acceleration)
Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
No. There is no separate loss function in the pseudocode, such as the MSE you would see used in supervised learning. The error term (often called the TD error) is given by the part in square brackets, and achieves a similar effect. Literally the term ∇q(S,A,w) (sorry for missing hat, no LaTex on SO) means the gradient of the estimator itself - not the gradient of any loss function.

How can I improve the performance of a feedforward network as a q-value function approximator?

I'm trying to navigate an agent in a n*n gridworld domain by using Q-Learning + a feedforward neural network as a q-function approximator. Basically the agent should find the best/shortest way to reach a certain terminal goal position (+10 reward). Every step the agent takes it gets -1 reward. In the gridworld there are also some positions the agent should avoid (-10 reward, terminal states,too).
So far I implemented a Q-learning algorithm, that saves all Q-values in a Q-table and the agent performs well.
In the next step, I want to replace the Q-table by a neural network, trained online after every step of the agent. I tried a feedforward NN with one hidden layer and four outputs, representing the Q-values for the possible actions in the gridworld (north,south,east, west).
As input I used a nxn zero-matrix, that has a "1" at the current positions of the agent.
To reach my goal I tried to solve the problem from the ground up:
Explore the gridworld with standard Q-Learning and use the Q-map as training data for the Network once Q-Learning is finished
--> worked fine
Use Q-Learning and provide the updates of the Q-map as trainingdata
for NN (batchSize = 1)
--> worked good
Replacy the Q-Map completely by the NN. (This is the point, when it gets interesting!)
-> FIRST MAP: 4 x 4
As described above, I have 16 "discrete" Inputs, 4 Output and it works fine with 8 neurons(relu) in the hidden layer (learning rate: 0.05). I used a greedy policy with an epsilon, that reduces from 1 to 0.1 within 60 episodes.
The test scenario is shown here. Performance is compared beetween standard qlearning with q-map and "neural" qlearning (in this case i used 8 neurons and differnt dropOut rates).
To sum it up: Neural Q-learning works good for small grids, also the performance is okay and reliable.
-> Bigger MAP: 10 x 10
Now I tried to use the neural network for bigger maps.
At first I tried this simple case.
In my case the neural net looks as following: 100 input; 4 Outputs; about 30 neurons(relu) in one hidden layer; again I used a decreasing exploring factor for greedy policy; over 200 episodes the learning rate decreases from 0.1 to 0.015 to increase stability.
At frist I had problems with convergence and interpolation between single positions caused by the discrete input vector.
To solve this I added some neighbour positions to the vector with values depending on thier distance to the current position. This improved the learning a lot and the policy got better. Performance with 24 neurons is seen in the picture above.
Summary: the simple case is solved by the network, but only with a lot of parameter tuning (number of neurons, exploration factor, learning rate) and special input transformation.
Now here are my questions/problems I still haven't solved:
(1) My network is able to solve really simple cases and examples in a 10 x 10 map, but it fails as the problem gets a bit more complex. In cases where failing is very likely, the network has no change to find a correct policy.
I'm open minded for any idea that could improve performace in this cases.
(2) Is there a smarter way to transform the input vector for the network? I'm sure that adding the neighboring positons to the input vector on the one hand improve the interpolation of the q-values over the map, but on the other hand makes it harder to train special/important postions to the network. I already tried standard cartesian two-dimensional input (x/y) on an early stage, but failed.
(3) Is there another network type than feedforward network with backpropagation, that generally produces better results with q-function approximation? Have you seen projects, where a FF-nn performs well with bigger maps?
It's known that Q-Learning + a feedforward neural network as a q-function approximator can fail even in simple problems [Boyan & Moore, 1995].
Rich Sutton has a question in the FAQ of his web site related with this.
A possible explanation is the phenomenok known as interference described in [Barreto & Anderson, 2008]:
Interference happens when the update of one state–action pair changes the Q-values of other pairs, possibly in the wrong direction.
Interference is naturally associated with generalization, and also happens in conventional supervised learning. Nevertheless, in the reinforcement learning paradigm its effects tend to be much more harmful. The reason for this is twofold. First, the combination of interference and bootstrapping can easily become unstable, since the updates are no longer strictly local. The convergence proofs for the algorithms derived from (4) and (5) are based on the fact that these operators are contraction mappings, that is, their successive application results in a sequence converging to a fixed point which is the solution for the Bellman equation [14,36]. When using approximators, however, this asymptotic convergence is lost, [...]
Another source of instability is a consequence of the fact that in on-line reinforcement learning the distribution of the incoming data depends on the current policy. Depending on the dynamics of the system, the agent can remain for some time in a region of the state space which is not representative of the entire domain. In this situation, the learning algorithm may allocate excessive resources of the function approximator to represent that region, possibly “forgetting” the previous stored information.
One way to alleviate the interference problem is to use a local function approximator. The more independent each basis function is from each other, the less severe this problem is (in the limit, one has one basis function for each state, which corresponds to the lookup-table case) [86]. A class of local functions that have been widely used for approximation is the radial basis functions (RBFs) [52].
So, in your kind of problem (n*n gridworld), an RBF neural network should produce better results.
References
Boyan, J. A. & Moore, A. W. (1995) Generalization in reinforcement learning: Safely approximating the value function. NIPS-7. San Mateo, CA: Morgan Kaufmann.
André da Motta Salles Barreto & Charles W. Anderson (2008) Restricted gradient-descent algorithm for value-function approximation in reinforcement learning, Artificial Intelligence 172 (2008) 454–482

Neural Networks back propogation

I have gone through neural networks and have understood the derivation for back propagation almost perfectly(finally!). However, I had a small doubt.
We are updating all the weights simultaneously, so what is the guarantee that they lead to a smaller cost. If the weights are updated one by one, it would definitely lead to a lesser cost and it would be similar to linear regression. But if you update all the weights simultaneously, might we not cross the minima?
Also, do we update the biases like we update the weights after each forward propagation and back propagation of each test case?
Lastly, I have started reading on RNN's. What are some good resources to understand BPTT in RNN's?
Yes, updating only one weight at the time could result in decreasing error value every time but it's usually infeasible to do such updates in practical solutions using NN. Most of today's architectures usually have ~ 10^6 parameters so one epoch for every parameter could last enormously long. Moreover - because of nature of backpropagation - you usually have to compute loads of different derivatives in order to compute derivative with respect to a parameter given - so you will waste a lot of computations when using such approach.
But the phenomenon which you mention has been noticed a long time ago and there are some ways in dealing with it. There are two most common issues connected with it:
Covariance shift: it's when error and weight updates of a layer given strongly depends on output from previous layer, so when you update it - the results in the next layer might be different. The most common way to deal with this problem right now is Batch normalization.
Nolinear function vs Linear Differentation: it's quite uncommon when you think about BP but derivative is a linear operator which might generate a lot of problems in gradient descent. The most countintuitive example is the fact that if you multiply your input by a constant then every derivative will also be multiplied by the same number. This may lead to a lot of problems but most of recent methods of learning do a great job in dealing with it.
About BPTT I stronly recomend you Geoffrey Hinton course about ANN and especially this video.

LSTM NN: forward propagation

I am new to neural nets, and am creating a LSTM from scratch. I have the forward propagation working...but I have a few questions about the moving pieces in forward propagation in the context of a trained model, back propagation, and memory management.
So, right now, when I run forward propagation, I stack the new columns, f_t, i_t, C_t, h_t, etc on their corresponding arrays as I accumulate previous positions for the bptt gradient calculations.
My question is 4 part:
1) How far back in time do I need to back propagate in order to retain reasonably long-term memories? (memory stretching back 20-40 time steps is probably what I need for my system (although I could benefit from a much longer time period--that is just the minimum for decent performance--and I'm only shooting for the minimum right now, so I can get it working)
2) Once I consider my model "trained," is there any reason for me to keep more than the 2 time-steps I need to calculate the next C and h values? (where C_t is the Cell state, and h_t is the final output of the LSTM net) in which case I would need multiple versions of the forward propagation function
3) If I have limited time series data on which to train, and I want to train my model, will the performance of my model converge as I train it on the training data over and over (as versus oscillate around some maximal average performance)? And will it converge if I implement dropout?
4) How many components of the gradient do I need to consider? When I calculate the gradient of the various matrices, I get a primary contribution at time step t, and secondary contributions from time step t-1 (and the calculation recurses all the way back to t=0)? (in other words: does the primary contribution dominate the gradient calculation--will the slope change due to the secondary components enough to warrant implementing the recursion as I back propagate time steps...)
As you have observed, it depends on the dependencies in the data. But LSTM can learn to learn longer term dependencies even though we back propagate only a few time steps if we do not reset the cell and hidden states.
No. Given c_t and h_t, you can determine c and h for the next time step. Since you don't need to back propagate, you can throw away c_t (and even h_t if you are just interested in the final LSTM output)
You might converge if you start over fitting. Using Dropout will definitely help avoiding that, especially along with early stopping.
There will be 2 components of gradient for h_t - one for current output and one from the next time step. Once you add the both, you won't have to worry about any other components

Solving ODEs in NetLogo, Eulers vs R-K vs R solver

In my model each agent solves a system of ODEs at each tick. I have employed Eulers method (similar to the systems dynamics modeler in NetLogo) to solve these first order ODEs. However, for a stable solution, I am forced to use a very small time step (dt), which means the simulation proceeds very slowly with this method. I´m curious if anyone has advice on a method to solve the ODEs more quickly? I am considering implementing Runge-Kutta (with a larger time step?) as was done here (http://academic.evergreen.edu/m/mcavityd/netlogo/Bouncing_Ball.html). I would also consider using the R extension and using an ODE solver in R. But again, the ODEs are solved by each agent, so I don´t know if this is an efficient method.
I´m hoping someone has a feel for the performance of these methods and could offer some advice. If not, I will try to share what I find out.
In general your idea is correct. For a method of order p to reach a global error level tol over an integration interval of length T you will need a step size in the magnitude range
h=pow(tol/T,1.0/p).
However, not only the discretization error accumulates over the N=T/h steps, but also the floating point error. This gives a lower bound for useful step sizes of magnitude h=pow(T*mu,1.0/(p+1)).
Example: For T=1, mu=1e-15 and tol=1e-6
the Euler method of order 1 would need a step size of about h=1e-6 and thus N=1e+6 steps and function evaluations. The range of step sizes where reasonable results can be expected is bounded below by h=3e-8.
the improved Euler or Heun method has order 2, which implies a step size 1e-3, N=1000 steps and 2N=2000 function evaluations, the lower bound for useful step sizes is 1e-3.
the classical Runge-Kutta method has order 4, which gives a required step size of about h=3e-2 with about N=30 steps and 4N=120 function evaluations. The lower bound is 1e-3.
So there is a significant gain to be had by using higher order methods. At the same time the range where step size reduction results in a lower global error also gets significantly narrower for increasing order. But at the same time the achievable accuracy increases. So one has to knowingly care when the point is reached to leave well enough alone.
The implementation of RK4 in the ball example, as in general for the numerical integration of ODE, is for an ODE system x'=f(t,x), where x is the, possibly very large, state vector
A second order ODE (system) is transformed to a first order system by making the velocities members of the state vector. x''=a(x,x') gets transformed to [x',v']=[v, a(x,v)]. The big vector of the agent system is then composed of the collection of the pairs [x,v] or, if desired, as the concatenation of the collection of all x components and the collection of all v components.
In an agent based system it is reasonable to store the components of the state vector belonging to the agent as internal variables of the agent. Then the vector operations are performed by iterating over the agent collection and computing the operation tailored to the internal variables.
Taking into consideration that in the LOGO language there are no explicit parameters for function calls, the evaluation of dotx = f(t,x) needs to first fix the correct values of t and x before calling the function evaluation of f
save t0=t, x0=x
evaluate k1 = f_of_t_x
set t=t0+h/2, x=x0+h/2*k1
evaluate k2=f_of_t_x
set x=x0+h/2*k2
evaluate k3=f_of_t_x
set t=t+h, x=x0+h*k3
evaluate k4=f_of_t_x
set x=x0+h/6*(k1+2*(k2+k3)+k4)