Policy network for the game 2048 - neural-network

I'm trying to implement a policy network agent for the game 2048 according to Karpathy's RL tutorial. I know the algorithm will need to play some batch of games, remember the inputs and actions taken, normalize and mean center the ending scores. However, I got stuck at the design of the loss function. How to correctly encourage actions that lead to the better final scores and discourage those that lead to worse scores?
When using softmax at the output layer, I devised something along this:
loss = sum((action - net_output) * reward)
where action is in one hot format. However, this loss doesn't seem to do much, the network doesn't learn. My full code (without the game environment) in PyTorch is here.

For the policy network in your code, I think you want something like this:
loss = -(log(action_probability) * reward)
Where action_probability is your network's output for the action performed in that timestep.
For example, if your network outputted a 10% chance of taking that action, but it provided a reward of 10, your loss would be: -(log(0.1) * 10) which is equal to 10.
But, if your network already thought that was a good move and outputted a 90% chance of taking that action you would have -log(0.9) * 10) which is roughly equal to 0.45, affecting the network less.
It's worth noting that PyTorch's log function isn't numerically stable and you might be better off using logsoftmax in the final layer of your network.

Related

Suggestions for optimising FPGA design

I need to figure out optimisations for this FPGA design. I've got a few ideas and I'd like to know if they sound reasonable for my design. I'd also like to ask if anyone has any other ideas to improve my designs efficiency.
The design I have to optimise is an ensemble of neurons, I've included two images below.
My current ideas
Add pipeline registers between each neuron and each adder
'Register the inputs and outputs' by inserting registers in-between each logic block
Convert the adder tree into an adder chain
Use time division multiplexers to share the LUT's between logic blocks
Do my current ideas for improving performance make any sense? I don't know very much about FPGA's at all so I'm not sure if my optimisations will make much of an improvement or if they even make sense.
Any help would be greatly appreciated.
Links to PDFs of my neuron and emsemble (The image quality is higher):
https://francismcnamee.com/pdfs/neuron_ensemble.pdf
https://francismcnamee.com/pdfs/single_neuron.pdf
Ensemble of neurons (Each subsystem is a single neuron, the design of each neuron is shown below)
A single neuron
To start with "less area usage and/or faster speed." forget about and: you can optimise of area or speed. Both will not work.
Use time division multiplexers to share the LUT's between logic blocks"
Multiplexers are also build from LUTs so you loose area before you gain some. Then the TDM needs to have a controller, the interim results need to be stored and retrieved. All and all it is not trivial and I would only do that if you are rather good at logic design. You may gain area, but you will lose speed.
Convert the adder tree into an adder chain.
No, you don't touch the adder tree. The FPGA synthesis tool will select the optimal adder configuration for you. It will balance area against speed and come up with something much better result then you can for yourself.
In fact this applies to every part of the design:let the synthesis tool do its work. You will not be able to outperform it.
Add pipeline registers between each neuron and each adder
Register the inputs and outputs' by inserting registers in-between each logic block
Sorry again but: Nope! Working with registers is not that simple.
You need to balance the registers. Ideally the logic delay between each pipeline stage should be the same.
Lets say you multiply takes** 10nS. The adder takes 3nS. Then you should place a pipeline stages after a set of 3 adders. The delay will be ~20nS. If you placed an pipeline stage after each adder, the total delay would be ~40nS.
Now you get to the core of speeding up a design: do you use 4 pipeline stages so you can run at 200 MHz or 2 pipeline stages and run at 100MHz? In both cases the throughput is the same.
Beware that each register stage also cost you time: you need to meet the set-up time of the register. As such the fastest design is the one with no registers: the data falls through at the maximum speed. But then you may need to wait a long time before you can present the next set of data.
As you may gather: balancing registers is not easy and is rather an art. The best way would be to run the design without any registers through the synthesis tool. Then run a timing analysis on it and look at the worst-case timing path. From that try to figure out where to put the register stages. But again, that is easier said then done. To me reading those timing analysis reports is easy, but for a novice they might seem all abracadabra.
Sorry if I let you hanging here but unfortunately there is no "magic trick" in these cases. Ideally you could let an experienced design play a few hours with your code and see what (s)he can do.
**The numbers I use have been made up

What to do with not enough training data?

I have a problem that I don't have enough training data for my NN. It is trying to predict the result of a soccer game given the last games which I woulf say is a regression task.
The training data are results of soccer games of the last 15 seasons (which are about 4500 games). Getting to new data would be hard and would take a lot of time.
What should I do now?
Is it good to duplicate the data?
Should I input randomized data? (Maybe noise but I'm not quite sure what that is)
If there is no way of creating more data,
I should probably turn up the learning rate right? (I have it sitting at 0.01 and the momentum at 0.9)
I am using mini batches consisting of 32 training datas in training. Since I don't have a lot of training I don't have a lot of mini batches. Should I stop using them?
To start from the beginning: This is a very theoretical question and is not directly related to programming, which I recommend (in future) to post over at the Data Science Stackexchange.
To go into your problem: 4500 samples is not as bad as it sounds, depending on the exact task at hand. Are you trying to predict the match results (i.e. which team is the winner?), are you looking for more specific predictions (across a lot of different, specific teams)?
If you can make sure that you have a reasonable amount of data per class, one can work with a number of samples lower than what you have. Simply duplicating the data will not help you much, since you are very likely to just overfit on the samples you are seeing, without much of an improvement; Or rather, you will get the same results as training over a longer period (since essentially you see every sample twice per epoch, instead of one).
Again, what usually happens after long training periods is overfitting, so nothing gained here.
Your second suggestion is generally called data augmentation. Instead of simply copying samples, you alter them enough to make it look "different" to the network. But be careful! Data augmentation works well for some inputs, like images, since the change in input is significant enough to not represent the same sample, but still contains meaningful information about the class (a horizontally mirrored image of a cat still shows a "valid cat", unlike a vertically mirrored image, which is more unrealistic in the real world).
Essentially, it depends on your input features to determine where it makes sense to add noise. If you are only changing the results of the previous game, a minor change in input (adding/subtracting one goal at random) can significantly change the prediction you make.
If you slightly scramble ELO scores by a random number, on the other hand, the input value will not be too different, "but different enough" to use it as a novel example.
Turning up the learning rate is not a good idea, since you are essentially just letting the network converge more towards the specific samples. On the contrary, I would argue that the current learning rate is still too high, and you should certainly not increase it.
Regarding mini batches, I think I have referenced this a million times now, but always consider smaller minibatches. From a theoretical point of view, you are more likely to converge to a local minimum.

Matlab Simulink: while loop with subtraction

I am hoping somebody here will be able to help me out with my small issue with one of the Simulink/Matlab code. It is quite similar to the problem that I’ve discussed earlier, but a little bit more complicated and now it is more a Simulink issue, rather than a Matlab one.
So I have a turbine which speed is controlled by the gate’s opening, hence the control voltage. By controlling the gate opening I am accelerating the turbine and at some point in time, I need to introduce a saturation effect (since I am testing the code now, it will be done an external signal). This effect won’t change the control voltage, but it affects other components of the system, hence at the same control voltage, the turbine’s speed will go up. But at the same time, I need to keep the speed at the same value as it was before the saturation effect (let’s say it was 320 rpm). To do so I need to decrease the control voltage and should keep doing it until I reach the speed as it was before. There is no need to do it instantly (this approach will be later introduced in hardware), but it will be a nice thing to check the algorithm in these synthetic tests.
In terms of the model, I was planning to use a while loop with the speed requirement “if speed > 320” again, now just to simplify things. To decrease the control voltage I was planning to subtract from the original 50 (% opening) - 0.25 (u2) at first and after that increasing this value by 0.25 until I decrease the speed below 320. I can’t know the exact opening when this requirement will be satisfied, hence I need some kind of algorithm to “track” this voltage.
So it should be something like this:
u2 = 0;
While speed > 320
u2 = u2+0.25
End
u2 is initially zero since we have a predefined initial control voltage. And obviously, when we reach the motor’s speed below 320, I need to keep the latest value of the u2 (and control voltage).
Overall, it is a small code and should be done in Simulink (don’t want to introduce any other Fcn function into the model). I’ve never used while and if blocks in Simulink, but so far I came up with this system. It’s a simplified version of my model, but the control principle is the same.
We are getting the motor speed of 350, compared with 320 (the speed before “saturation), and if our speed after saturation is higher, we need to reduce the control voltage. To trigger the while loop block I’ve decided to use a simple switch. The while block meanwhile is:
Definitely not the best implementation but I was trying a lot of different combinations and without any real success. I am always getting the same error:
Was trying to use a step signal instead of the constant “7” – to model acceleration of the motor, and was getting the same error at the moment of acceleration above 320 threshold. So looks like the approach is almost right but mathematically it fails to find the most suitable solution. I’ve tried to implement a transport delay in the memory part of the while subsystem but was getting errors during compilation all the time.
Are there any obvious (and not so) mistakes? Or maybe from the beginning, I should have chosen another approach… I really hope that somebody will be able to help. Thank you in advance and have a great day.
I do not think that you have used While block correctly.
This is what I have done, I used a "Matlab function" block instead of "While" block as follows,
The function in Matlab function is
function u2=fcn(speed,u2d)
if speed>320
u2=u2d+0.25;
else
u2=u2d;
end
And the results I have got, Scope 1
Scope
Edit
As you prefer a function free model, the following may do the same.

What is the policy gradient when multiple actions are possible?

I am trying to program a reinforcement learning algorithm using policy gradients, as inspired by Karpathy's blog article. Karpathy's example has only two actions UP or DOWN, so a single output neuron is sufficient (high activation=UP, low activation=DOWN). I want to extend this to multiple actions, so I believe I need a softmax activation function on the output layer. However, I am not certain about what the gradient for the output layer should be.
If I was using a cross-entropy loss function with the softmax activation in a supervised learning context, the gradient for neuron is simply:
g[i] = a[i] - target[i]
where target[i] = 1 for the desired action and 0 for all others.
To use this for reinforcement learning I would multiply g[i] by the discounted reward before back-propagating.
However, it seems that reinforcement learning uses negative log-likelihood as the loss instead of cross-entropy. How does that change the gradient?
Note: something that I think will get you on the right track:
The negative log likelihood is also know as the multiclass cross-entropy (Pattern Recognition and Machine Learning).
EDIT: misread the question. I thought this was talking about Deep Deterministic Policy Gradients
It would depend on your domain, but with a softmax, you are getting a probability across all output nodes. To me that doesn't really make sense in most domains when you think about DDPG. For example, if you are controlling the extension of robotic arms and legs, it wouldn't make sense to have limb extension measured as [.25, .25, .25, .25], if you wanted to have all limbs extended. In this case, .25 could mean fully extended, but what happens if the vector of outputs is [.75,.25,0,0]? So in this way, you could have a separate sigmoid function from 0 to 1 for all action nodes, where then you could represent it as [1,1,1,1] for all arms being extended. I hope that makes sense.
Since the actor network is what determines the actions in DDPG, we could then represent our network like this for our robot (rough keras example):
state = Input(shape=[your_state_shape])
hidden_layer = Dense(30, activation='relu')(state)
all_limbs = Dense(4, activation='sigmoid')(hidden_layer)
model = Model(input=state, output=all_limbs)
Then, your critic network will have to account for the action dimensions.
state = Input(shape=[your_state_shape])
action = Input(shape=[4])
state_hidden = Dense(30, activation='relu')(state)
state_hidden_2 = Dense(30, activation='linear')(state_hidden)
action_hidden = Dense(30, activation='linear')(action)
combined = merge([state_hidden_2, action_hidden], mode='sum')
squasher = Dense(30, activation='relu')(combined)
output = Dense(4, activation='linear')(squasher) #number of actions
Then you can use your target functions from there. Note, I don't know if this working code, as I haven't tested it, but hopefully you get the idea.
Source: https://arxiv.org/pdf/1509.02971.pdf
Awesome blog on this with Torc (not created by me): https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html
In the above blog, they also show how to use different output functions, such as one TAHN, and two sigmoid functions for actions.

Neural Network playing Tic Tac Toe doesn't learn

I have a neural network playing tic-tac-toe. (I know there are other better methods for this, but I want to learn about NN)
So the NN plays against a random AI. First, it should learn to make an allowed move, ie. not choosing a field that is already occupied.
It doesn't get very far with this, however.
When NN chooses an illegal move I optimize the weights such that the distance to another, randomly chosen (legal) field is minimized. (There is one output which should have values between 1 and 9).
My problem is: in changing the weights, a formerly optimized outcome is now also changed. So I have this kind of overfitting: Everytime I backpropagade to optimize the weights for one particular situation, the decision for every other situation becomes worse!
I know I should probably have 9 output neurons instead of 1 and should probably not use a random field as the target, as I assume this can mess things up. I am starting to change this.
Still, the issue seems to remain. Obviously. How can I improve the decision in one situation without forgetting every other situation?
One solution I came up with is to "remember" every game played and optimizing simultaneously over all games played.
However, after a while this becomes very demanding on the computation. Also, it seems to go into the direction of a complete enumartion of all possible board situations. This might be possible for Tic Tac Toe but if I move to another game, say Go, this becomes infeasible.
Where is my mistake? How do I generally tackle this problem? Or where could I read about it? Thanks a lot!
To tackle this problem efficiently, you sould consider Reinforcement Learning methods, instead of what you are currently doing. What your are trying to do is to learn the behaviour of an agent playing Tic Tac Toe. The agent gets a high reward when he wins a game, a high penalty when he loses and an even higher penalty when he performs an illegal move. My guess is that using methods such as Q-learning with neural networks will work perfectly, even with very simple neural nets. One useful paper on the topic could be: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf, or earlier papers on TD-Gammon (I think you can easily find tutorials on the topic using the keywords TD-Gammon, Q-learning, ...).
By the way, a more down-to-earth answer to why your model might not work is that you are seemingly using one single unit to represent categorical outputs: if you want to represent an integer between 1 and N, you should represent it using N output neurons with values between 0 and 1, and pick the neuron with the highest value as your answer. Using a single neuron with value between 1 and 9 creates an unatural assymetry between your outputs, and, for example, when the expected value is 3, your network gets a higher error for outputing a 9 than a 2. This should obviously not be the case: all wrong answers are equally wrong.
Hope this helps,
Best