State representation for grid world - neural-network

I'm new to reinforcement learning and q-learning and I'm trying to understand concepts and try to implement them. Most of material I have found use CNN layers to process image input. I think I would rather start with something simpler than than, so I use grid world.
This is what I have already implemented. I implemented an environment by following MDP and have 5x5 grid, with fixed agent position (A) and target position (T). Start state could look like this.
-----
---T-
-----
-----
A----
Currently I represent my state as a 1-dimensional vector of length 25 (5x5) where 1 is on position where Agent is, otherwise 0, so for example
the state above will be repsented as vector
[1, 0, 0, ..., 0]
I have successfully implemented solutions with Q table and simple NN with no hidden layer.
Now, I want to move little further and make the task more complicated by making Target position random each episode. Because now there is no correlation between my current representation of state and actions, my agents act randomly. In order to solve my problem, first I need to adjust my state representation to contain some information like distance to the target, direction or both. The problem is, that I don't how to represent my state now. I have come up with some ideas:
[x, y, distance_T]
[distance_T]
two 5x5 vectors, one for Agent's position, one for Target's position
[1, 0, 0, ..., 0], [0, 0, ..., 1, 0, ..., 0]
I know that even if I will figure out the state representation, my implemented model will not be able to solve the problem and I will need to move toward hidden layers, experience replay, frozen target network and so on, but I only want to verify the model failure.
In conclusion, I want to ask how to represent such state as an input for neural network. If there are any sources of informations, articles, papers etc. which I have missed, feel free to post them.
Thank you in advance.

In Reinforcement Learning there is no right state representation. But there are wrong state representations. At least, that is to say that Q-learning and other RL techniques make a certain assumption about the state representation.
It is assumed that the states are states of a Markov Decision Process (MDP). An MDP is one where everything you need to know to 'predict' (even in a probabilistic way) is available in the current state. That is to say that the agent must not need memory of past states to make a decision.
It is very rarely the case in real life that you have a Markov decision process. But many times you have something close, which has been empirically shown to be enough for RL algorithms.
As a "state designer" you want to create a state that makes your task as close as possible to an MDP. In your specific case, if you have the distance as your state there is very little information to predict the next state, that is the next distance. Some thing like the current distance, the previous distance and the previous action is a better state, as it gives you a sense of direction. You could also make your state be the distance and the direction to which the target is at.
Your last suggestion of two matrices is the one I like most. Because it describes the whole state of the task without giving away the actual goal of the task. It also maps well to convolutional networks.
The distance approach will probably converge faster, but I consider it a bit like cheating because you practically tell the agent what it needs to look for. In more complicated cases this will rarely be possible.

Your last suggestion is the most general way to represent states as an input for function approximators, especially for Neural Networks. By that representation, you can also add more dimensions that will stand for non-accessible blocks and even other agents. So, you generalize the representation and might apply it to other RL domains. You will also have the chance to try Convolutional NNs for bigger grids.

Related

Reinforcement learning. Driving around objects with PPO

I am working on driving industrial robots with neural nets and so far it is working well. I am using the PPO algorithm from the OpenAI baseline and so far I can drive easily from point to point by using the following rewarding strategy:
I calculate the normalized distance between the target and the position. Then I calculate the distance reward with.
rd = 1-(d/dmax)^a
For each time step, I give the agent a penalty calculated by.
yt = 1-(t/tmax)*b
a and b are hyperparameters to tune.
As I said this works really well if I want to drive from point to point. But what if I want to drive around something? For my work, I need to avoid collisions and therefore the agent needs to drive around objects. If the object is not straight in the way of the nearest path it is working ok. Then the robot can adapt and drives around it. But it gets more and more difficult to impossible to drive around objects which are straight in the way.
See this image :
I already read a paper which combines PPO with NES to create some Gaussian noise for the parameters of the neural network but I can't implement it by myself.
Does anyone have some experience with adding more exploration to the PPO algorithm? Or does anyone have some general ideas on how I can improve my rewarding strategy?
What you describe is actually one of the most important research areas of Deep RL: the exploration problem.
The PPO algorithm (like many other "standard" RL algos) tries to maximise a return, which is a (usually discounted) sum of rewards provided by your environment:
In your case, you have a deceptive gradient problem, the gradient of your return points directly at your objective point (because your reward is the distance to your objective), which discourage your agent to explore other areas.
Here is an illustration of the deceptive gradient problem from this paper, the reward is computed like yours and as you can see, the gradient of your return function points directly to your objective (the little square in this example). If your agent starts in the bottom right part of the maze, you are very likely to be stuck in a local optimum.
There are many ways to deal with the exploration problem in RL, in PPO for example you can add some noise to your actions, some other approachs like SAC try to maximize both the reward and the entropy of your policy over the action space, but in the end you have no guarantee that adding exploration noise in your action space will result in efficient of your state space (which is actually what you want to explore, the (x,y) positions of your env).
I recommend you to read the Quality Diversity (QD) literature, which is a very promising field aiming to solve the exploration problem in RL.
Here is are two great resources:
A website gathering all informations about QD
A talk from ICLM 2019
Finally I want to add that the problem is not your reward function, you should not try to engineer a complex reward function such that your agent is able to behave like you want. The goal is to have an agent that is able to solve your environment despite pitfalls like the deceptive gradient problem.

Error function and ReLu in a CNN

I'm trying to get a better understanding of neural networks by trying to programm a Convolution Neural Network by myself.
So far, I'm going to make it pretty simple by not using max-pooling and using simple ReLu-activation. I'm aware of the disadvantages of this setup, but the point is not making the best image detector in the world.
Now, I'm stuck understanding the details of the error calculation, propagating it back and how it interplays with the used activation-function for calculating the new weights.
I read this document (A Beginner's Guide To Understand CNN), but it doesn't help me understand much. The formula for calculating the error already confuses me.
This sum-function doesn't have defined start- and ending points, so i basically can't read it. Maybe you can simply provide me with the correct one?
After that, the author assumes a variable L that is just "that value" (i assume he means E_total?) and gives an example for how to define the new weight:
where W is the weights of a particular layer.
This confuses me, as i always stood under the impression the activation-function (ReLu in my case) played a role in how to calculate the new weight. Also, this seems to imply i simply use the error for all layers. Doesn't the error value i propagate back into the next layer somehow depends on what i calculated in the previous one?
Maybe all of this is just uncomplete and you can point me into the direction that helps me best for my case.
Thanks in advance.
You do not backpropagate errors, but gradients. The activation function plays a role in caculating the new weight, depending on whether or not the weight in question is before or after said activation, and whether or not it is connected. If a weight w is after your non-linearity layer f, then the gradient dL/dw wont depend on f. But if w is before f, then, if they are connected, then dL/dw will depend on f. For example, suppose w is the weight vector of a fully connected layer, and assume that f directly follows this layer. Then,
dL/dw=(dL/df)*df/dw //notations might change according to the shape
//of the tensors/matrices/vectors you chose, but
//this is just the chain rule
As for your cost function, it is correct. Many people write these formulas in this non-formal style so that you get the idea, but that you can adapt it to your own tensor shapes. By the way, this sort of MSE function is better suited to continous label spaces. You might want to use softmax or an svm loss for image classification (I'll come back to that). Anyway, as you requested a correct form for this function, here is an example. Imagine you have a neural network that predicts a vector field of some kind (like surface normals). Assume that it takes a 2d pixel x_i and predicts a 3d vector v_i for that pixel. Now, in your training data, x_i will already have a ground truth 3d vector (i.e label), that we'll call y_i. Then, your cost function will be (the index i runs on all data samples):
sum_i{(y_i-v_i)^t (y_i-vi)}=sum_i{||y_i-v_i||^2}
But as I said, this cost function works if the labels form a continuous space (here , R^3). This is also called a regression problem.
Here's an example if you are interested in (image) classification. I'll explain it with a softmax loss, the intuition for other losses is more or less similar. Assume we have n classes, and imagine that in your training set, for each data point x_i, you have a label c_i that indicates the correct class. Now, your neural network should produce scores for each possible label, that we'll note s_1,..,s_n. Let's note the score of the correct class of a training sample x_i as s_{c_i}. Now, if we use a softmax function, the intuition is to transform the scores into a probability distribution, and maximise the probability of the correct classes. That is , we maximse the function
sum_i { exp(s_{c_i}) / sum_j(exp(s_j))}
where i runs over all training samples, and j=1,..n on all class labels.
Finally, I don't think the guide you are reading is a good starting point. I recommend this excellent course instead (essentially the Andrew Karpathy parts at least).

Basic intuition for neural networks?

There are lots of "introduction to neural networks" articles online, but most are an introduction to the math of artificial neural networks and not an introduction to the actual underlying concepts (even though they should be one and the same). How does a simple network of artificial neurons actually work?
This answer is roughly based on the beginning of "Neural Networks and Deep Learning" by M. A. Nielsen which is definitely worth reading - it's online and free.
The fundamental idea behind all neural networks is this: Each neuron in a neural network makes a decision. Once you understand how they do that, everything else will make sense. Let’s walk through a simple situation which will help us arrive at that understanding.
Let’s say you are trying to decide whether or not to wear a hat today. There are a number of factors which will affect your decision, and perhaps the most important ones are:
Is it sunny?
Do I have a hat to wear?
Would a hat suit my outfit?
For simplicity, we’ll assume these are the only three factors that you’re weighing up during this decision. Forgetting about neural networks for a second, let’s just try to build a ‘decision maker’ to help us answer this question.
First, we can see each question has a certain level of importance, and so we’ll need to use this relative importance of each question, along with the corresponding answer to each question, to make our decision.
Secondly, we’ll need to have some component which interprets each (yes or no) answer along with its importance to produce the final answer. This sounds simple enough to put into an equation, right? Let’s do it. We simply decide how important each factor is and multiply that importance (or ‘weight’) by the answer to the question (which can be 0 or 1):
3a + 5b + 2c > 6
The numbers 3, 5 and 2 are the ‘weights’ of question a, b and c, respectively. a, b and c, themselves can be either zero (the answer to the question was ‘no’), or one (the answer to the question was ‘yes’). If the above equation is true, then the decision is to wear a hat, and if it is false, the decision is to not wear a hat. The equation says that we’ll only wear a hat if the sum of our weights multiplied by our factors is greater than some threshold value. Above, I chose a threshold value of 6. If you think about it, this means that if I don’t have a hat to wear (b=0), no matter what the other answers are, I won’t be wearing a hat today. That is,
3a + 2c > 6
is never true, since a and c are only either 0 or 1. This makes sense – our simple decision model tells us not to wear a hat if we don’t have one! So the weights of 3, 5 and 2, and the threshold value of 6 seem like a good choices for our simple “should I wear a hat” decision-maker. It also means that, as long as I have a hat to wear, the sun shining (a=1) OR the hat suiting my outfit (c=1) is enough to make me wear a hat today. That is,
5 + 3 > 6 and 5 + 2 > 6
are both true. Good! You can see that by adjusting the weighting of each factor and the threshold, and by adding more factors, we can adjust our ‘decision maker’ to approximately model any decision-making process. What we have just demonstrated is the functionality of a simple neuron (a decision-maker!). Let’s put the above equation into ‘neuron-form’:
A neuron which processes 3 factors: a, b, c, with corresponding importance weightings of 3, 5, 2, and with a decision threshold of 6.
The neuron has 3 input connections (the factors) and 1 output connection (the decision). Each input connection has a weighting which encodes the importance of that connection. If the weighting of that connection is low (relative to the other weights), then it won’t have much effect on the decision. If it’s high, the decision will heavily depend on it.
This is great, we’ve got a fully working neuron that weights inputs and makes decisions. So here’s the next though: What if the output (our decision) was fed into the input of another neuron? That neuron would be using our decision about our hat to make a more abstract decision. And what if the inputs a, b and c are themselves the outputs of other neurons which compute lower-level decisions? We can see that neural networks can be interpreted as networks which compute decisions about decisions, leading from simple input data to more and more complex ‘meta-decisions’. This, to me, is an incredible concept. All the complexity of even the human brain can be modelled using these principles. From the level of photons interacting with our cone-cells right up to our pondering of the meaning of life, it’s just simple little decision-making neurons.
Below is a diagram of a simple neural network which essentially has 3 layers of abstraction:
A simple neural network with 2 inputs and 2 outputs.
As an example, the above inputs could be 2 infrared distance sensors, and the outputs might control control the on/off switch for 2 motors which drive the wheels of a robot.
In our simple hat example, we could pick the weights and the threshold quite easily, but how do we pick the weights and thresholds in this example so that, say, the robot can follow things that move? And how do we know how many neurons we need to solve this problem? Could we solve it with just 1 neuron, maybe 2? Or do we need 20? And how do we organise them? In layers? Modules? These questions are the questions in the field of neural networks. Techniques such as ‘backpropagation’ and (more recently) ‘neuroevolution’ are used effectively to answer some of these troubling questions, but these are outside the scope of this introduction – Wikipedia and Google Scholar and free online textbooks like “Neural Networks and Deep Learning” by M. A. Nielsen are great places to start learning about these concepts.
Hopefully you now have some intuition for how neural networks work, but if you’re interested in actually implementing a neural network there are a few optimisations and extensions to our concept of a neuron which.will make our neural nets more efficient and effective.
Firstly, notice that if we set the threshold value of the neuron to zero, we can always adjust the weightings of the inputs to account for this – only, we’ll also need to allow negative values for our weights. This is great since it removes one variable from our neuron. So we’ll allow negative weights and from now on we won’t need to worry about setting a threshold – it’ll always be zero.
Next, we’ll notice that the weights of the input connections are all relative to one-another, so we can actually normalise these to a value between -1 and 1. Cool. That simplifies things a little.
We can make a further, more substantial improvement to our decision-maker by realising that the inputs themselves (a, b and c in the above example) need not just be 0 or 1. For example, what if today is really sunny? Or maybe there’s scattered clouds, do it’s intermittently sunny? We can see that by allowing values between 0 and 1, our neuron gets more information and can therefore make a better decision – and the good news is, we don’t need to change anything in our neuron model!
So far, we’ve allowed the neuron to accept inputs between 0 and 1, and we’ve normalised the weights between -1 and 1 for convenience.
The next question is: why do we need such certainty in our final decision (i.e. the output of the neuron)? Why can’t it, like the inputs, also be a value between 0 and 1? If we did allow this, the decision of whether or not to wear a hat would become a level of certainty that wearing a hat is the right choice. But if this is a good idea, why did I introduce a threshold at all? Why not just directly pass on the sum of the weighted inputs to the output connection? Well, because, for reasons beyond the scope of this simple introduction to neural networks, it turns out that a neural network works better if the neurons are allowed to make something like an ‘educated guess’, rather than just presenting a raw probability. A threshold gives the neurons a slight bias toward certainty and allows them to be more ‘assertive’, and doing so makes neural networks more efficient. So in that sense, a threshold is good. But the problem with a threshold is that it doesn’t let us know when the neuron is uncertain about its decision – that is, if the sum of the weighted inputs is very close to the threshold, the neuron makes a definite yes/no answer where a definite yes/no answer is not ideal.
So how can we overcome this problem? Well it turns out that if we replace our “greater than zero” condition with a continuous function (called an ‘activation function’), then we can choose non-binary and non-linear reactions to the neuron’s weighted inputs. Let’s first look at our original “greater than zero” condition as a function:
‘Step’ function representing the original neuron’s ‘activation function’.
In the above activation function, the x-axis represents the sum of the weighted inputs and the y-axis represents the neuron’s output. Notice that even if the inputs sum to 0.01, the output is a very certain 1. This is not ideal, as we’ve explained earlier. So we need another activation function that only has a bias towards certainty. Here’s where we welcome the ‘sigmoid’ function:
The ‘sigmoid’ function; a more effective activation function for our artificial neural networks.
Notice how it looks like a halfway point between a step function (which we established as too certain) and a linear x=y line that we’d expect from a neuron which just outputs the raw probability that some some decision is correct. The equation for this sigmoid function is:
where x is the sum of the weighted inputs.
And that’s it! Our new-and-improved neuron does the following:
Takes multiple inputs between 0 and 1.
Weights each one by a value between -1 and 1.
Sums them all together.
Puts that sum into the sigmoid function.
Outputs the result!
It's deceptively simple, but by combining these simple decision-makers together and finding ideal connection weights, we can make arbitrarily complex decisions and calculations which stretch far beyond what our biological brains allow.

LSTM NN: forward propagation

I am new to neural nets, and am creating a LSTM from scratch. I have the forward propagation working...but I have a few questions about the moving pieces in forward propagation in the context of a trained model, back propagation, and memory management.
So, right now, when I run forward propagation, I stack the new columns, f_t, i_t, C_t, h_t, etc on their corresponding arrays as I accumulate previous positions for the bptt gradient calculations.
My question is 4 part:
1) How far back in time do I need to back propagate in order to retain reasonably long-term memories? (memory stretching back 20-40 time steps is probably what I need for my system (although I could benefit from a much longer time period--that is just the minimum for decent performance--and I'm only shooting for the minimum right now, so I can get it working)
2) Once I consider my model "trained," is there any reason for me to keep more than the 2 time-steps I need to calculate the next C and h values? (where C_t is the Cell state, and h_t is the final output of the LSTM net) in which case I would need multiple versions of the forward propagation function
3) If I have limited time series data on which to train, and I want to train my model, will the performance of my model converge as I train it on the training data over and over (as versus oscillate around some maximal average performance)? And will it converge if I implement dropout?
4) How many components of the gradient do I need to consider? When I calculate the gradient of the various matrices, I get a primary contribution at time step t, and secondary contributions from time step t-1 (and the calculation recurses all the way back to t=0)? (in other words: does the primary contribution dominate the gradient calculation--will the slope change due to the secondary components enough to warrant implementing the recursion as I back propagate time steps...)
As you have observed, it depends on the dependencies in the data. But LSTM can learn to learn longer term dependencies even though we back propagate only a few time steps if we do not reset the cell and hidden states.
No. Given c_t and h_t, you can determine c and h for the next time step. Since you don't need to back propagate, you can throw away c_t (and even h_t if you are just interested in the final LSTM output)
You might converge if you start over fitting. Using Dropout will definitely help avoiding that, especially along with early stopping.
There will be 2 components of gradient for h_t - one for current output and one from the next time step. Once you add the both, you won't have to worry about any other components

Neural network for approximation function for board game

I am trying to make a neural network for approximation of some unkown function (for my neural network course). The problem is that this function has very many variables but many of them are not important (for example in [f(x,y,z) = x+y] z is not important). How could I design (and learn) network for this kind of problem?
To be more specific the function is an evaluation function for some board game with unkown rules and I need to somehow learn this rules by experience of the agent. After each move the score is given to the agent so actually it needs to find how to get max score.
I tried to pass the neighborhood of the agent to the network but there are too many variables which are not important for the score and agent is finding very local solutions.
If you have a sufficient amount of data, your ANN should be able to ignore the noisy inputs. You also may want to try other learning approaches like scaled conjugate gradient or simple heuristics like momentum or early stopping so your ANN isn't over learning the training data.
If you think there may be multiple, local solutions, and you think you can get enough training data, then you could try a "mixture of experts" approach. If you go with a mixture of experts, you should use ANNs that are too "small" to solve the entire problem to force it to use multiple experts.
So, you are given a set of states and actions and your target values are the score after the action is applied to the state? If this problem gets any hairier, it will sound like a reinforcement learning problem.
Does this game have discrete actions? Does it have a discrete state space? If so, maybe a decision tree would be worth trying?