Reinforcement learning. Driving around objects with PPO - neural-network

I am working on driving industrial robots with neural nets and so far it is working well. I am using the PPO algorithm from the OpenAI baseline and so far I can drive easily from point to point by using the following rewarding strategy:
I calculate the normalized distance between the target and the position. Then I calculate the distance reward with.
rd = 1-(d/dmax)^a
For each time step, I give the agent a penalty calculated by.
yt = 1-(t/tmax)*b
a and b are hyperparameters to tune.
As I said this works really well if I want to drive from point to point. But what if I want to drive around something? For my work, I need to avoid collisions and therefore the agent needs to drive around objects. If the object is not straight in the way of the nearest path it is working ok. Then the robot can adapt and drives around it. But it gets more and more difficult to impossible to drive around objects which are straight in the way.
See this image :
I already read a paper which combines PPO with NES to create some Gaussian noise for the parameters of the neural network but I can't implement it by myself.
Does anyone have some experience with adding more exploration to the PPO algorithm? Or does anyone have some general ideas on how I can improve my rewarding strategy?

What you describe is actually one of the most important research areas of Deep RL: the exploration problem.
The PPO algorithm (like many other "standard" RL algos) tries to maximise a return, which is a (usually discounted) sum of rewards provided by your environment:
In your case, you have a deceptive gradient problem, the gradient of your return points directly at your objective point (because your reward is the distance to your objective), which discourage your agent to explore other areas.
Here is an illustration of the deceptive gradient problem from this paper, the reward is computed like yours and as you can see, the gradient of your return function points directly to your objective (the little square in this example). If your agent starts in the bottom right part of the maze, you are very likely to be stuck in a local optimum.
There are many ways to deal with the exploration problem in RL, in PPO for example you can add some noise to your actions, some other approachs like SAC try to maximize both the reward and the entropy of your policy over the action space, but in the end you have no guarantee that adding exploration noise in your action space will result in efficient of your state space (which is actually what you want to explore, the (x,y) positions of your env).
I recommend you to read the Quality Diversity (QD) literature, which is a very promising field aiming to solve the exploration problem in RL.
Here is are two great resources:
A website gathering all informations about QD
A talk from ICLM 2019
Finally I want to add that the problem is not your reward function, you should not try to engineer a complex reward function such that your agent is able to behave like you want. The goal is to have an agent that is able to solve your environment despite pitfalls like the deceptive gradient problem.

Related

Genetic Algorithm A.I. repetitive behavior

I am writing a C# Windows Forms Application which simulates a simple environment (grid) with two types of objects: plants and herbivores. The herbivores have neural networks which take contents the surrounding few cells as input that decide which direction to move in. The idea is to train the herbivores to eat the plants using a fitness function and a genetic algorithm.
My problem is that if there is nothing surrounding a herbivore, it will decide to move in a particular direction, then, if there is still nothing around it, it will move in the same direction again. What I end up with is a few herbivores that just move in strait lines and never actually encounter any plants at all.
Would adding a clock signal as an input (with each bit as an individual input to the neural network) change this behavior or is this not recommended? I have also thought about adding an input which is just random data (from a Gaussian distribution) to add some unpredictability, but I don't know if this would help or harm the problem. Another idea I am not sure about is if maybe having inputs for the past few moves (as a sort of memory) might solve this issue.
I think you need a Recurrent Network. You can keep track of the last N decisions the network has made and then use them as extra inputs to your network so it will have some sort of knowledge about where it was going and for how long. It could at some point evolve in such a way that it starts doing some sort of path finding.
What #Can_Alper said is definitely good. Also take a look at LSTM's.

Can neural network actually learn?

I'm creating an evolution-artificial-life-simulation game in 2D (purely for fun purposes). It combines neural networks (for behaviour controlling) and genetic algorithm (for breeding and mutations).
On input I give them X,Y position of nearest food (normalized) and X,Y position of the "look at" vector.
Currently they fly around and when they collide with food (let's call it "eating apples") their fitness index is increased by one and the apple's position is randomed - after 2000 turns the GA interrupts and does its magic.
After about 100 generations they learn that eating apples is good and try to fly to the nearest ones.
But my question, as a neural network newbie, is - if I created a room where apples spawn way more frequent than on the rest of the map, would they learn and understand that? Would they fly to that room more often? And is it possible to tell how many generations would it take for them to learn?
What they can learn and how fast depends a lot on the information you give them access to. For instance, if they have no way of knowing that they are in the room where food generates more frequently, then there is no way for them to evolve to go there more frequently.
It's not entirely clear from your question what the "look at" vector is. If it, for instance, shows them what's directly in front of them, then it might be enough information for them to figure out that they're in the room of plenty, particularly if that room "looks" distinctive somehow. A more useful input to give them might be their current X and Y coordinates. If you did that, then I would definitely expect them to evolve to be in the good room more frequently (in proportion to how good it is, of course), because it would be possible for them to take action to go to and stay in that room.
As for how many generations it will take, that is incredibly hard to predict (especially without knowing more about your setup). If it takes them 100 generations to learn to eat food, then I would expect it to be on the order of hundreds. But the best way to find out is just to try it.
If it's all about location, they may keep a state of the map in their mind and simple statistics will let them learn where the food may be located. Neural nets is an overkill there.
If there are other features of locations (for example color, smell, height etc...) to map those features to the label (food exists or not) is good for neural nets. Especially if some of features not available or not reliable randomly at the moment.
If they need many decisions to reach the goal, you will need reinforcement learning. Forexample, they may go to a direction which is good for a time, but make them away from resources they will need later.
I believe that a recurrent neural network could learn to expect apples to spawn in a certain region.

particle swarm optimization inertia factor

i am reading in soft computing algorithms ,currently in "Particle Swarm Optimization ",i understand the technique in general but ,i stopped at mathematical or physics part which i can't imagine or understand how it works or how it affect the flying,that part is the first part in the equation which update the velocity which is called the "Inertia Factor"
the complete update velocity equation is :
i read in one article in section 2.3 "Ineteria Factor" that:
"This variation of the algorithm aims to balance two possible PSO tendencies (de-
pendent on parameterization) of either exploiting areas around known solutions
or explore new areas of the search space. To do so this variation focuses on the
momentum component of the particles' velocity equation 2. Notice that if you
remove this component the movement of the particle has no memory of the pre-
vious direction of movement and it will always explore close to a found solution.
On the other hand if the velocity component is used, or even multiplied by a w
(inertial weight, balances the importance of the momentum component) factor
the particle will tend to explore new areas of the search space since it cannot
easily change its velocity towards the best solutions. It must rst \counteract"
the momentum previously gained, in doing so it enables the exploration of new
areas with the time \spend counteracting" the previous momentum. This vari-
ation is achieved by multiplying the previous velocity component with a weight
value, w."
the full pdf at: https://www.google.com.eg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDIQFjAA&url=http%3A%2F%2Fweb.ist.utl.pt%2F~gdgp%2FVA%2Fdata%2Fpso.pdf&ei=0HwrUaHBOYrItQbwwIDoDw&usg=AFQjCNH8vChXHXWz_ydHxJKAY0cUa94n-g
but i can't also imagine how physicaly or numerically this is happend and how this factor affect going from exploration level to exploitative level ,so need a numerical example to see how it's work and imagine how it's work.
also ,in Genetic Algorithm there's a schema theorem which is a proof of GA success of finding optimum solution,is there's such athoerm for PSO.
It's not easy to explain PSO using mathematics (see Wikipedia article for example).
But you can think like this: the equation has 3 parts:
particle speed = inertia + local memory + global memory
So you control the 'importance' of these components by varying the coefficientes in each part.
There's no analytical way to see this, unless you make the stocastic part constant and ignore things like particle-particle interation.
Exploit: take advantage of the best know solutions (local and global).
Explore: search in new directions, but don't ignore the best know solutions.
In a nutshell, you can control how much importance to give for the particle current speed (inertia), the particle memory of the best know solution, and the particle memory of the swarm best know solution.
I hope it can help you!
Br's
Inertia was not the part of the original PSO algorithm introduced by Kennedy and Eberhart in 1995. It's been three years until Shi and Eberhart published this extension and showed (to some extent) that it works better.
One can set that value to a constant (supposedly [0.8 to 1.2] is best).
However, the point of the parameter is to balance exploitation and exploration of space, and
authors got best results when they defined the parameter with a linear function, that decreases over time from [1.4 to 0].
Their rationale was that first one should exploit solutions to find a good seed and later exploit area around the seed.
My feeling about it is that the closer you are to 0, the more chaotic turns particles make.
For a detailed answer refer to Shi, Eberhart 1998 - "A modified Particle Swarm Optimizer".
Inertia controls the influence of the previous velocity.
When high, cognitive and social components are less relevant. (particle keeps going its way, exploring new portions of the space)
When low, particle explores better the space where the best-so-far optimum has been found
Inertia can change over time: Start high, later decrease

Neural network for approximation function for board game

I am trying to make a neural network for approximation of some unkown function (for my neural network course). The problem is that this function has very many variables but many of them are not important (for example in [f(x,y,z) = x+y] z is not important). How could I design (and learn) network for this kind of problem?
To be more specific the function is an evaluation function for some board game with unkown rules and I need to somehow learn this rules by experience of the agent. After each move the score is given to the agent so actually it needs to find how to get max score.
I tried to pass the neighborhood of the agent to the network but there are too many variables which are not important for the score and agent is finding very local solutions.
If you have a sufficient amount of data, your ANN should be able to ignore the noisy inputs. You also may want to try other learning approaches like scaled conjugate gradient or simple heuristics like momentum or early stopping so your ANN isn't over learning the training data.
If you think there may be multiple, local solutions, and you think you can get enough training data, then you could try a "mixture of experts" approach. If you go with a mixture of experts, you should use ANNs that are too "small" to solve the entire problem to force it to use multiple experts.
So, you are given a set of states and actions and your target values are the score after the action is applied to the state? If this problem gets any hairier, it will sound like a reinforcement learning problem.
Does this game have discrete actions? Does it have a discrete state space? If so, maybe a decision tree would be worth trying?

Pathfinding algorithm with only partial knowledge of graph

I need to program an algorithm to navigate a robot through a "maze" (a rectangular grid with a starting point, a goal, empty spaces and uncrossable spaces or "walls"). It can move in any cardinal direction (N, NW, W, SW, S, SE, E, NE) with constant cost per move.
The problem is that the robot doesn't "know" the layout of the map. It can only view it's 8 surrounding spaces and store them (it memorizes the surrounding tiles of every space it visits). The only other input is the cardinal direction in which the goal is on every move.
Is there any researched algorithm that I could implement to solve this problem? The typical ones like Dijkstra's or A* aren't trivialy adapted to the task, as I can't go back to revisit previous nodes in the graph without cost (retracing the steps of the robot to go to a better path would cost the moves again), and can't think of a way to make a reasonable heuristic for A*.
I probably could come up with something reasonable, but I just wanted to know if this was an already solved problem, and I need not reinvent the wheel :P
Thanks for any tips!
The problem isn't solved, but like with many planning problems, there is a large amount of research already available.
Most of the work in this area is based on the original work of R. E. Korf in the paper "Real-time heuristic search". That paper seems to be paywalled, but the preliminary results from the paper, along with a discussion of the Real-Time A* algorithm are still available.
The best recent publications on discrete planning with hidden state (path-finding with partial knowledge of the graph) are by Sven Koenig. This includes the significant work on the Learning Real-Time A* algorithm.
Koenig's work also includes some demonstrations of a range of algorithms on theoretical experiments that are far more challenging that anything that would be likely to occur in a simulation. See in particular "Easy and Hard Testbeds for Real-Time Search Algorithms" by Koenig and Simmons.