What is the policy gradient when multiple actions are possible? - neural-network

I am trying to program a reinforcement learning algorithm using policy gradients, as inspired by Karpathy's blog article. Karpathy's example has only two actions UP or DOWN, so a single output neuron is sufficient (high activation=UP, low activation=DOWN). I want to extend this to multiple actions, so I believe I need a softmax activation function on the output layer. However, I am not certain about what the gradient for the output layer should be.
If I was using a cross-entropy loss function with the softmax activation in a supervised learning context, the gradient for neuron is simply:
g[i] = a[i] - target[i]
where target[i] = 1 for the desired action and 0 for all others.
To use this for reinforcement learning I would multiply g[i] by the discounted reward before back-propagating.
However, it seems that reinforcement learning uses negative log-likelihood as the loss instead of cross-entropy. How does that change the gradient?

Note: something that I think will get you on the right track:
The negative log likelihood is also know as the multiclass cross-entropy (Pattern Recognition and Machine Learning).
EDIT: misread the question. I thought this was talking about Deep Deterministic Policy Gradients
It would depend on your domain, but with a softmax, you are getting a probability across all output nodes. To me that doesn't really make sense in most domains when you think about DDPG. For example, if you are controlling the extension of robotic arms and legs, it wouldn't make sense to have limb extension measured as [.25, .25, .25, .25], if you wanted to have all limbs extended. In this case, .25 could mean fully extended, but what happens if the vector of outputs is [.75,.25,0,0]? So in this way, you could have a separate sigmoid function from 0 to 1 for all action nodes, where then you could represent it as [1,1,1,1] for all arms being extended. I hope that makes sense.
Since the actor network is what determines the actions in DDPG, we could then represent our network like this for our robot (rough keras example):
state = Input(shape=[your_state_shape])
hidden_layer = Dense(30, activation='relu')(state)
all_limbs = Dense(4, activation='sigmoid')(hidden_layer)
model = Model(input=state, output=all_limbs)
Then, your critic network will have to account for the action dimensions.
state = Input(shape=[your_state_shape])
action = Input(shape=[4])
state_hidden = Dense(30, activation='relu')(state)
state_hidden_2 = Dense(30, activation='linear')(state_hidden)
action_hidden = Dense(30, activation='linear')(action)
combined = merge([state_hidden_2, action_hidden], mode='sum')
squasher = Dense(30, activation='relu')(combined)
output = Dense(4, activation='linear')(squasher) #number of actions
Then you can use your target functions from there. Note, I don't know if this working code, as I haven't tested it, but hopefully you get the idea.
Source: https://arxiv.org/pdf/1509.02971.pdf
Awesome blog on this with Torc (not created by me): https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html
In the above blog, they also show how to use different output functions, such as one TAHN, and two sigmoid functions for actions.

Related

Ways to Improve Universal Differential Equation Training with sciml_train

About a month ago I asked a question about strategies for better convergence when training a neural differential equation. I've since gotten that example to work using the advice I was given, but when I applied what the same advice to a more difficult model, I got stuck again. All of my code is in Julia, primarily making use of the DiffEqFlux library. In effort to keep this post as brief as possible, I won't share all of my code for everything I've tried, but if anyone wants access to it to troubleshoot I can provide it.
What I'm Trying to Do
The data I'm trying to learn comes from an SIRx model:
function SIRx!(du, u, p, t)
β, μ, γ, a, b = Float32.([280, 1/50, 365/22, 100, 0.05])
S, I, x = u
du[1] = μ*(1-x) - β*S*I - μ*S
du[2] = β*S*I - (μ+γ)*I
du[3] = a*I - b*x
nothing
end;
The initial condition I used was u0 = Float32.([0.062047128, 1.3126149f-7, 0.9486445]);. I generated data from t=0 to 25, sampled every 0.02 (in training, I only use every 20 points or so for speed, and using more doesn't improve results). The data looks like this: Training Data
The UDE I'm training is
function SIRx_ude!(du, u, p, t)
μ, γ = Float32.([1/50, 365/22])
S,I,x = u
du[1] = μ*(1-x) - μ*S + ann_dS(u, #view p[1:lenS])[1]
du[2] = -(μ+γ)*I + ann_dI(u, #view p[lenS+1:lenS+lenI])[1]
du[3] = ann_dx(u, #view p[lenI+1:end])[1]
nothing
end;
Each of the neural networks (ann_dS, ann_dI, ann_dx) are defined using FastChain(FastDense(3, 20, tanh), FastDense(20, 1)). I tried using a single neural network with 3 inputs and 3 outputs, but it was slower and didn't perform any better. I also tried normalizing inputs to the network first, but it doesn't make a significant difference outside of slowing things down.
What I've Tried
Single shooting
The network just fits a line through the middle of the data. This happens even when I weight the earlier datapoints more in the loss function. Single-shot Training
Multiple Shooting
The best result I had was with multiple shooting. As seen here, it's not simply fitting a straight line, but it's not exactly fitting the data eitherMultiple Shooting Result. I've tried continuity terms ranging from 0.1 to 100 and group sizes from 3 to 30 and it doesn't make a significant difference.
Various Other Strategies
I've also tried iteratively growing the fit, 2-stage training with a collocation, and mini-batching as outlined here: https://diffeqflux.sciml.ai/dev/examples/local_minima, https://diffeqflux.sciml.ai/dev/examples/collocation/, https://diffeqflux.sciml.ai/dev/examples/minibatch/. Iteratively growing the fit works well the first couple of iterations, but as the length increases it goes back to fitting a straight line again. 2-stage collocation training works really well for stage 1, but it doesn't actually improve performance on the second stage (I've tried both single and multiple shooting for the second stage). Finally, mini-batching worked about as well as single-shooting (which is to say not very well) but much more quickly.
My Question
In summary, I have no idea what to try. There are so many strategies, each with so many parameters that can be tweaked. I need a way to diagnose the problem more precisely so I can better decide how to proceed. If anyone has experience with this sort of problem, I'd appreciate any advice or guidance I can get.
This isn't a great SO question because it's more exploratory. Did you lower your ODE tolerances? That would improve your gradient calculation which could help. What activation function are you using? I would use something like softplus instead of tanh so that you don't have the saturating behavior. Did you scale the eigenvalues and take into account the issues explored in the stiff neural ODE paper? Larger neural networks? Different learning rates? ADAM? Etc.
This is much better suited for a forum for discussion like the JuliaLang Discourse. We can continue there since walking through this will not be fruitful without some back and forth.

Policy network for the game 2048

I'm trying to implement a policy network agent for the game 2048 according to Karpathy's RL tutorial. I know the algorithm will need to play some batch of games, remember the inputs and actions taken, normalize and mean center the ending scores. However, I got stuck at the design of the loss function. How to correctly encourage actions that lead to the better final scores and discourage those that lead to worse scores?
When using softmax at the output layer, I devised something along this:
loss = sum((action - net_output) * reward)
where action is in one hot format. However, this loss doesn't seem to do much, the network doesn't learn. My full code (without the game environment) in PyTorch is here.
For the policy network in your code, I think you want something like this:
loss = -(log(action_probability) * reward)
Where action_probability is your network's output for the action performed in that timestep.
For example, if your network outputted a 10% chance of taking that action, but it provided a reward of 10, your loss would be: -(log(0.1) * 10) which is equal to 10.
But, if your network already thought that was a good move and outputted a 90% chance of taking that action you would have -log(0.9) * 10) which is roughly equal to 0.45, affecting the network less.
It's worth noting that PyTorch's log function isn't numerically stable and you might be better off using logsoftmax in the final layer of your network.

What does the parameter retain_graph mean in the Variable's backward() method?

I'm going through the neural transfer pytorch tutorial and am confused about the use of retain_variable(deprecated, now referred to as retain_graph). The code example show:
class ContentLoss(nn.Module):
def __init__(self, target, weight):
super(ContentLoss, self).__init__()
self.target = target.detach() * weight
self.weight = weight
self.criterion = nn.MSELoss()
def forward(self, input):
self.loss = self.criterion(input * self.weight, self.target)
self.output = input
return self.output
def backward(self, retain_variables=True):
#Why is retain_variables True??
self.loss.backward(retain_variables=retain_variables)
return self.loss
From the documentation
retain_graph (bool, optional) – If False, the graph used to compute
the grad will be freed. Note that in nearly all cases setting this
option to True is not needed and often can be worked around in a much
more efficient way. Defaults to the value of create_graph.
So by setting retain_graph= True, we're not freeing the memory allocated for the graph on the backward pass. What is the advantage of keeping this memory around, why do we need it?
#cleros is pretty on the point about the use of retain_graph=True. In essence, it will retain any necessary information to calculate a certain variable, so that we can do backward pass on it.
An illustrative example
Suppose that we have a computation graph shown above. The variable d and e is the output, and a is the input. For example,
import torch
from torch.autograd import Variable
a = Variable(torch.rand(1, 4), requires_grad=True)
b = a**2
c = b*2
d = c.mean()
e = c.sum()
when we do d.backward(), that is fine. After this computation, the parts of the graph that calculate d will be freed by default to save memory. So if we do e.backward(), the error message will pop up. In order to do e.backward(), we have to set the parameter retain_graph to True in d.backward(), i.e.,
d.backward(retain_graph=True)
As long as you use retain_graph=True in your backward method, you can do backward any time you want:
d.backward(retain_graph=True) # fine
e.backward(retain_graph=True) # fine
d.backward() # also fine
e.backward() # error will occur!
More useful discussion can be found here.
A real use case
Right now, a real use case is multi-task learning where you have multiple losses that maybe be at different layers. Suppose that you have 2 losses: loss1 and loss2 and they reside in different layers. In order to backprop the gradient of loss1 and loss2 w.r.t to the learnable weight of your network independently. You have to use retain_graph=True in backward() method in the first back-propagated loss.
# suppose you first back-propagate loss1, then loss2 (you can also do the reverse)
loss1.backward(retain_graph=True)
loss2.backward() # now the graph is freed, and next process of batch gradient descent is ready
optimizer.step() # update the network parameters
This is a very useful feature when you have more than one output of a network. Here's a completely made up example: imagine you want to build some random convolutional network that you can ask two questions of: Does the input image contain a cat, and does the image contain a car?
One way of doing this is to have a network that shares the convolutional layers, but that has two parallel classification layers following (forgive my terrible ASCII graph, but this is supposed to be three convlayers, followed by three fully connected layers, one for cats and one for cars):
-- FC - FC - FC - cat?
Conv - Conv - Conv -|
-- FC - FC - FC - car?
Given a picture that we want to run both branches on, when training the network, we can do so in several ways. First (which would probably be the best thing here, illustrating how bad the example is), we simply compute a loss on both assessments and sum the loss, and then backpropagate.
However, there's another scenario - in which we want to do this sequentially. First we want to backprop through one branch, and then through the other (I have had this use-case before, so it is not completely made up). In that case, running .backward() on one graph will destroy any gradient information in the convolutional layers, too, and the second branch's convolutional computations (since these are the only ones shared with the other branch) will not contain a graph anymore! That means, that when we try to backprop through the second branch, Pytorch will throw an error since it cannot find a graph connecting the input to the output!
In these cases, we can solve the problem by simple retaining the graph on the first backward pass. The graph will then not be consumed, but only be consumed by the first backward pass that does not require to retain it.
EDIT: If you retain the graph at all backward passes, the implicit graph definitions attached to the output variables will never be freed. There might be a usecase here as well, but I cannot think of one. So in general, you should make sure that the last backwards pass frees the memory by not retaining the graph information.
As for what happens for multiple backward passes: As you guessed, pytorch accumulates gradients by adding them in-place (to a variable's/parameters .grad property).
This can be very useful, since it means that looping over a batch and processing it once at a time, accumulating the gradients at the end, will do the same optimization step as doing a full batched update (which only sums up all the gradients as well). While a fully batched update can be parallelized more, and is thus generally preferable, there are cases where batched computation is either very, very difficult to implement or simply not possible. Using this accumulation, however, we can still rely on some of the nice stabilizing properties that batching brings. (If not on the performance gain)

Character Recognition Using Back Propagation Algorithm Testing

Recently I've been working on character recognition using Back Propagation Algorithm. I've taken the image and reduced to 5x7 size, therefore I got 35 pixels and trained the network using those pixels with 35 input neurons, 35 hidden nodes, and 10 output nodes. And I had completed the training successfully and I got weights that I needed. And I've got stuck here. I have my test set and I know I should feed forward the network. But I don't know what to do exactly. My test set will be 4 samples of 1x35. My output layer has 10 neurons. how do I exactly distinguish the characters with the output that I will get? I want to know how this testing works. Please guide me through this stage. Thanks in advance.
One vs All
A common approach for testing these types of neural networks is "one-vs-all" approach. We view each of the output nodes as its own classifier that is giving the probability of the sample being that class vs not being that class.
For instance if you network output [1, 0, ..., 0] then class 1 has high probability of being class 1 vs not being class 1. Class 2 has low probability of being class 2 vs not being class 2, etc.
Ties
In the case of a tie, it is common (in research) to have a random function break the tie. If you get [1, 1, 1, ..., 1] then the function would pick a number from 1-10 and that is your prediction. In practice sometimes an expert system is used to break ties. Perhaps class 1 is more expensive than class 2, so we tie in preference to class 2.
Steps
So the steps are:
Split dataset into test/train set
Train weights on train set
Pass test set forward through the neural network
For each sample, choose the argmax (the output with highest value) as your prediction
In case of tie, choose randomly between all tying classes
Aside
In your particular case, I imagine implementation of this strategy will result in a network that barely beats random performance (10%) accuracy.
I would suggest some reconsidering of the network architecture.
If you look at your 5x7 images, can you tell what number that image was originally? It seems likely that scaling the image down to this size losses too much information that the network cannot distinguish between classes.
Debugging
From what you've described I would look at the following when debugging your network.
Is your data preprocessing (down-scaling) leeching out too much information? Check this by manually investigating a few of the images and seeing if you can tell what the image should be.
Does your one-hot algorithm work? When you convert your targets for training, does it successfully convert 1 -> [1, 0, 0, ..., 0]?
Is your back-prop / gradient descent algorithm correct? You should see (roughly) a monotonic decrease in your loss function while training. Try at every step (or every few steps) printing the loss that you are optimizing. Or even for a very simple gut check, print mean squared error: (P-Y)^2

Neural networks, how do they look in coding?

Basically i know the concept of a neural network and what it is, but i can't figure out how it looks when you code it or how do you store the data, i went through many tutorials that i found on google but couldn't find any piece of code, just concepts and algorithms.
Can anyone give me a piece of code of a simple neural network something like "Hello World!"?
What you mainly need is an object representing a single neuron with the corrispective associations with other neurons (that represent synapses) and their weights.
A single neuron in a typical OOP language will be something like
class Synapse
{
Neuron sending;
Neuron receiving;
float weight;
}
class Neuron
{
ArrayList<Synapse> toSynapses;
ArrayList<Synapse> fromSynapses;
Function threshold;
}
where threshold represents the function that is applied on weighted sum of inputs to see if the neuron activates itself and propagates the signal.
Of course then you will need the specific algorithm to feed-forward the net or back-propagate the learning that will operate on this data structure.
The simplest thing you could start implement would be a simple perceptron, you can find some infos here.
You said you're already familiar with neural networks, but since there are many different types of neural networks of differing complexity (convolutional, hebbian, kohonen maps, etc.), I'll go over a simple Feed-forward neural network again, just to make sure we're on the same page.
A basic neural network consists of the following things
Neurons
Input Neuron(s)
Hidden Neurons (optional)
Output Neuron(s)
Links between Neurons (sometimes called synapses in analogy to biology)
An activation function
The Neurons have an activation value. When you evaluate a network, the input nodes' activation is set to the actual input. The links from the input nodes lead to nodes closer to the output, usually to one or more layers of hidden nodes. At each neuron, the input activation is processed using an activation function. Different activation functions can be used, and sometimes they even vary within the neurons of a single network.
The activation function processes the activation of the neuron into it's output. The early experiments usually used a simple threshold function (i.e. activation > 0.5 ? 1 : 0), nowadays a Sigmoid function is often used.
The output of the activation function is then propagated over the links to the next nodes. Each link has an associated weight it applies to its input.
Finally, the output of the network is extracted from the activation of the output neuron(s).
I've put together a very simple (and very verbose...) example here. It's written in Ruby and computes AND, which is about as simple as it gets.
A much trickier question is how to actually create a network that does something useful. The trivial network of the example was created manually, but that is infeasible with more complex problems. There are two approaches I am aware of, with the most common being backpropagation. Less used is neuroevolution, where the weights of the links are determined using a genetic algorithm.
AI-Junkie has a great tutorial on (A)NNs and they have the code posted there.
Here is a neuron (from ai-junkie):
struct SNeuron
{
//the number of inputs into the neuron
int m_NumInputs;
//the weights for each input
vector<double> m_vecWeight;
//ctor
SNeuron(int NumInputs);
};
Here is a neuron layer (ai-junkie):
struct SNeuronLayer
{
//the number of neurons in this layer
int m_NumNeurons;
//the layer of neurons
vector<SNeuron> m_vecNeurons;
SNeuronLayer(int NumNeurons, int NumInputsPerNeuron);
};
Like I mentioned before... you can find all of the code with the ai-junkie (A)NN tutorial.
This is the NUPIC programmer's guide. NuPIC is the framework to implement their theory (HTM) based on the structure and operation of the neocortex
This is how they define HTM
HTM technology has the potential to solve many difficult problems in machine learning, inference, and prediction. Some of the application areas we are exploring with our customers include recognizing objects in images, recognizing behaviors in videos, identifying the gender of a speaker, predicting traffic patterns, doing optical character recognition on messy text, evaluating medical images, and predicting click through patterns on the web.
this is a simple net with nument 1.5
from nupic.network import *
from nupic.network.simpledatainterface import WideDataInterface
def TheNet():
net=SimpleHTM(
levelParams=[
{ # Level 0
},
{ # Level 1
'levelSize': 8, 'bottomUpOut': 8,
'spatialPoolerAlgorithm': 'gaussian',
'sigma': 0.4, 'maxDistance': 0.05,
'symmetricTime': True, 'transitionMemory': 1,
'topNeighbors': 2, 'maxGroupSize': 1024,
'temporalPoolerAlgorithm': 'sumProp'
},
{ # Level 2
'levelSize': 4, 'bottomUpOut': 4,
'spatialPoolerAlgorithm': 'product',
'symmetricTime': True, 'transitionMemory': 1,
'topNeighbors': 2, 'maxGroupSize': 1024,
'temporalPoolerAlgorithm': 'sumProp'
},
{ # Level 3
'levelSize': 1,
'spatialPoolerAlgorithm': 'product',
'mapperAlgorithm': 'sumProp'
},],)
Data=WideDataInterface('Datos/__Categorias__.txt', 'Datos/Datos_Entrenamiento%d.txt', numDataFiles = 8)#
net.createNetwork(Data)
net.train(Datos)
if __name__ == '__main__':
print "Creating HTM Net..."
TheNet()