Neural network architecture for q learning

Neural network architecture for q learning - neural-network

Question: What is the correct approach to getting the right architecture and hyperparameters for getting an appropriate neural network for a simple grid game? And how can it be scaled to make it work in a version of the game with a larger grid?
Context: Most tutorials and papers written about using neural networks in Q learning make use of convolutional neural networks to be able to handle the screen inputs from different games. But I am experimenting with a far simpler game with raw data:
Simple Matrix Game
in which the possible moves for the agent are: up, down, right, left.
The notebook with the complete code can be found here: http://151.80.61.13/ql.html
All of the tested neural networks didn't achieve better than doing random moves. The reward went up to an average of 8.5 (out of 30 points) after ~1000 episodes and then started decreasing. Mostly eventually just spamming the same action for every move.
I know that for a small game as this a Q table would achieve better, but this is for learning to implement deep Q learning and after it working in a small example I want to scale it to a larger grid.
Current neural network (Keras) and solutions I have tried:
model = Sequential()
model.add(Dense(grid_size**2,input_shape=(grid_size, grid_size)))
model.add(Activation('relu'))
model.add(Dense(48))
model.add(Flatten())
model.add(Activation('linear'))
model.add(Dense(4))
adam = Adam(lr=0.1)
model.compile(optimizer=adam, loss='mse')
return model
Different hidden layer sizes: [512,256,100,48,32,24]
Varying number of hidden layers: [1,2,3]
Different learning rates: [3, 1, 0.8, 0.5, 0.3, 0.1, 0.01]
Testing variety of activation functions: [linear, sigmoid, softmax, relu]
Number of episodes and degree of epsilon decay
Trying with and without target network
Tried different networks from tutorials which were written voor OpenAI gym CartPole, FrozenLake and Flappy Bird.

As in any machine learning task there is no perfect way to choose your hyperparams, but there are a few pieces of advice I can give you.
The number of neurons in each layer must be to small to fit your model and not to big to not overfit your model (also if the number of neurons is a power of two it can better parallel on your gpu. The only rule you should follow here: more complex game - more neurons
The same rules work for the number of layers in your net, but if you're training any type of recurrent net it is better to go deeper than having more neurons.
Your learning rate depends on your optimiser, but it is always better to have a smaller learning rate as the model converges better with a low learning rate (though it converges longer)
There also is no rule on choosing your activation functions, but you're training any kind of generative model you should use Leaky ReLU, Softplus or Elu

Related

Use a trained neural network to imitate its training data

I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?

How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.

Reinforcement Learning training

I am new to reinforcement learning and doing a project about chess. I use neural network and temporal difference learning to train the engine learn the game.
The neural network has one input layer (of 385 features), two hidden layers and one output layer, whose range is [-1,1] where -1 means lose and 1 win (0 draw). I use TD-lambda to self-learn the chess, and the default case is to only consider next 10 moves. All the weights are initialized in the range [-1, 1].
I use forward propagation to estimate the value of state, but most of the values are very close to either 1 or -1, even the result is draw, which I think the engine doesn't learn well. I think some values are large and dominate the result, changing small weights does no help. I change the size of two hidden layers but it doesn't work (However, I have tried a toy example with small size and dimension, it can converge and the estimated value is very close to the target one after dozens of iteration). I don't know how to fix this, could someone give me some advice?
Thank you.
Some references are listed below
Giraffe: Using Deep Reinforcement Learning to Play Chess
KnightCap
Web Stanford

Q-learning using neural networks

I'm trying to implement the Deep q-learning algorithm for a pong game.
I've already implemented Q-learning using a table as Q-function. It works very well and learns how to beat the naive AI within 10 minutes. But I can't make it work
using neural networks as a Q-function approximator.
I want to know if I am on the right track, so here is a summary of what I am doing:
I'm storing the current state, action taken and reward as current Experience in the replay memory
I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units. for the input -> hidden layer I am using a sigmoid activation function. For hidden -> output layer I'm using a linear activation function
A state is represented by the position of both players and the ball, as well as the velocity of the ball. Positions are remapped, to a much smaller state space.
I am using an epsilon-greedy approach for exploring the state space where epsilon gradually goes down to 0.
When learning, a random batch of 32 subsequent experiences is selected. Then I
compute the target q-values for all the current state and action Q(s, a).
forall Experience e in batch
if e == endOfEpisode
target = e.getReward
else
target = e.getReward + discountFactor*qMaxPostState
end
Now I have a set of 32 target Q values, I am training the neural network with those values using batch gradient descent. I am just doing 1 training step. How many should I do?
I am programming in Java and using Encog for the multilayer perceptron implementation. The problem is that training is very slow and performance is very weak. I think I am missing something, but can't figure out what. I would expect at least a somewhat decent result as the table approach has no problems.

I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units.
Might be too big. Depends on your input / output dimensionality and the problem. Did you try fewer?
Sanity checks
Can the network possibly learn the necessary function?
Collect ground truth input/output. Fit the network in a supervised way. Does it give the desired output?
A common error is to have the last activation function something wrong. Most of the time, you will want a linear activation function (as you have). Then you want the network to be as small as possible, because RL is pretty unstable: You can have 99 runs where it doesn't work and 1 where it works.
Do I explore enough?
Check how much you explore. Maybe you need more exploration, especially in the beginning?
See also
My DQN agent
keras-rl

Try using ReLu (or better Leaky ReLu)-Units in the hidden layer and a Linear-Activision for the output.
Try changing the optimizer, sometimes SGD with propper learning-rate-decay helps.
Sometimes ADAM works fine.
Reduce the number of hidden units. It might be just too much.
Adjust the learning rate. The more units you have, the more impact does the learning rate have as the output is the weighted sum of all neurons before.
Try using the local position of the ball meaning: ballY - paddleY. This can help drastically as it reduces the data to: above or below the paddle distinguished by the sign. Remember: if you use the local position, you won't need the players paddle-position and the enemies paddle position must be local too.
Instead of the velocity, you can give it the previous state as an additional input.
The network can calculate the difference between those 2 steps.

How can I improve the performance of a feedforward network as a q-value function approximator?

I'm trying to navigate an agent in a n*n gridworld domain by using Q-Learning + a feedforward neural network as a q-function approximator. Basically the agent should find the best/shortest way to reach a certain terminal goal position (+10 reward). Every step the agent takes it gets -1 reward. In the gridworld there are also some positions the agent should avoid (-10 reward, terminal states,too).
So far I implemented a Q-learning algorithm, that saves all Q-values in a Q-table and the agent performs well.
In the next step, I want to replace the Q-table by a neural network, trained online after every step of the agent. I tried a feedforward NN with one hidden layer and four outputs, representing the Q-values for the possible actions in the gridworld (north,south,east, west).
As input I used a nxn zero-matrix, that has a "1" at the current positions of the agent.
To reach my goal I tried to solve the problem from the ground up:
Explore the gridworld with standard Q-Learning and use the Q-map as training data for the Network once Q-Learning is finished
--> worked fine
Use Q-Learning and provide the updates of the Q-map as trainingdata
for NN (batchSize = 1)
--> worked good
Replacy the Q-Map completely by the NN. (This is the point, when it gets interesting!)
-> FIRST MAP: 4 x 4
As described above, I have 16 "discrete" Inputs, 4 Output and it works fine with 8 neurons(relu) in the hidden layer (learning rate: 0.05). I used a greedy policy with an epsilon, that reduces from 1 to 0.1 within 60 episodes.
The test scenario is shown here. Performance is compared beetween standard qlearning with q-map and "neural" qlearning (in this case i used 8 neurons and differnt dropOut rates).
To sum it up: Neural Q-learning works good for small grids, also the performance is okay and reliable.
-> Bigger MAP: 10 x 10
Now I tried to use the neural network for bigger maps.
At first I tried this simple case.
In my case the neural net looks as following: 100 input; 4 Outputs; about 30 neurons(relu) in one hidden layer; again I used a decreasing exploring factor for greedy policy; over 200 episodes the learning rate decreases from 0.1 to 0.015 to increase stability.
At frist I had problems with convergence and interpolation between single positions caused by the discrete input vector.
To solve this I added some neighbour positions to the vector with values depending on thier distance to the current position. This improved the learning a lot and the policy got better. Performance with 24 neurons is seen in the picture above.
Summary: the simple case is solved by the network, but only with a lot of parameter tuning (number of neurons, exploration factor, learning rate) and special input transformation.
Now here are my questions/problems I still haven't solved:
(1) My network is able to solve really simple cases and examples in a 10 x 10 map, but it fails as the problem gets a bit more complex. In cases where failing is very likely, the network has no change to find a correct policy.
I'm open minded for any idea that could improve performace in this cases.
(2) Is there a smarter way to transform the input vector for the network? I'm sure that adding the neighboring positons to the input vector on the one hand improve the interpolation of the q-values over the map, but on the other hand makes it harder to train special/important postions to the network. I already tried standard cartesian two-dimensional input (x/y) on an early stage, but failed.
(3) Is there another network type than feedforward network with backpropagation, that generally produces better results with q-function approximation? Have you seen projects, where a FF-nn performs well with bigger maps?

It's known that Q-Learning + a feedforward neural network as a q-function approximator can fail even in simple problems [Boyan & Moore, 1995].
Rich Sutton has a question in the FAQ of his web site related with this.
A possible explanation is the phenomenok known as interference described in [Barreto & Anderson, 2008]:
Interference happens when the update of one state–action pair changes the Q-values of other pairs, possibly in the wrong direction.
Interference is naturally associated with generalization, and also happens in conventional supervised learning. Nevertheless, in the reinforcement learning paradigm its effects tend to be much more harmful. The reason for this is twofold. First, the combination of interference and bootstrapping can easily become unstable, since the updates are no longer strictly local. The convergence proofs for the algorithms derived from (4) and (5) are based on the fact that these operators are contraction mappings, that is, their successive application results in a sequence converging to a fixed point which is the solution for the Bellman equation [14,36]. When using approximators, however, this asymptotic convergence is lost, [...]
Another source of instability is a consequence of the fact that in on-line reinforcement learning the distribution of the incoming data depends on the current policy. Depending on the dynamics of the system, the agent can remain for some time in a region of the state space which is not representative of the entire domain. In this situation, the learning algorithm may allocate excessive resources of the function approximator to represent that region, possibly “forgetting” the previous stored information.
One way to alleviate the interference problem is to use a local function approximator. The more independent each basis function is from each other, the less severe this problem is (in the limit, one has one basis function for each state, which corresponds to the lookup-table case) [86]. A class of local functions that have been widely used for approximation is the radial basis functions (RBFs) [52].
So, in your kind of problem (n*n gridworld), an RBF neural network should produce better results.
References
Boyan, J. A. & Moore, A. W. (1995) Generalization in reinforcement learning: Safely approximating the value function. NIPS-7. San Mateo, CA: Morgan Kaufmann.
André da Motta Salles Barreto & Charles W. Anderson (2008) Restricted gradient-descent algorithm for value-function approximation in reinforcement learning, Artificial Intelligence 172 (2008) 454–482

Problems in reinforcement learning: bug, parameters tuning, and training period

I am currently training a reinforcement learning agent using a simple Neural Network with 100 hidden elements to solve 2048 game. I am using DQN's reinforcement learning algorithm (i.e. Q-learning with replay memory), but with 2 layers Neural Network instead of Deep Neural Network.
However, I left it trained on my laptop overnight (~7 hours, ~1000 games played, > 100000 steps) and the score does not seem to increase. I suspect there might be 3 sources of errors in my code: bug, parameters tuned badly, or maybe I just don't wait long enough.
Is there any method to figure out what is wrong with the code?
And what is the best practice to improve the training results?

I'll talk about all three of your hypothesis.
If you are using a standard DL framework like caffe or tensorflow, the chance of it being a bug is small.
Try decreasing the learning rate. Maybe you set it too high for the network to converge.
The training time of 100000 steps is not that long. For a simple pong game, you need to train around 500000 steps to get a good accuracy. So you can try training it for longer.
Also, 2048 is a fairly complicated game, so maybe you network is not deep enough to learn how to play it. Two layers is not much for such a complicated game. Try increasing the number of hidden layers. Perhaps you can use the network provided here

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse