Observation Space for race strategy development - Reinforcement learning - simulation

I refrained from asking for help until now, but as my thesis' deadline creeps ever closer and I do not know anybody with experience in RL, I'm trying my luck here.
TLDR;
I have not found an academic/online resource which helps me understand the correct representation of the environment as an observation space. I would be very thankful for any links or for giving me a starting point of how to model the specifics of my environment in an observation space.
Short thematic introduction
The goal of my research is to determine the viability of RL for strategy development in motorsports. This is currently achieved by simulating (lots of!) races and calculating the resulting race time (thus end-position) of different strategic decisions (which are the timing of pit stops + amount of laps to refuel for). This demands a manual input of expected inlaps (the lap a pit stop occurs) for all participants, which implicitly limits the possible strategies by human imagination as well as the amount of possible simulations.
Use of RL
A trained RL agent could decide on its own when to perform a pit stop and how much fuel should be added, in order to minizime the race time and react to probabilistic events in the simulation.
The action space is discrete(4) and represents the options to continue, pit and refuel for 2,4,6 laps respectively.
Problem
The observation space is of POMDP nature and needs to model the agent's current race position (which I hope is enough?). How would I implement the observation space accordingly?
The training is performed using OpenAI's Gym framework, but a general explanation/link to article/publication would also be appreciated very much!

Your observation could be just an integer which represents round or position the agent is in. This is obviously not a sufficient representation so you need to add more information.
A better observation could be the agents race position x1, the round the agent is in x2 and the current fuel in the tank x3. All three of these can be represented by a real number. Then you can create your observation by concating these to a vector obs = [x1, x2, x3].

Related

Machine Learning to predict time-series multi-class signal changes

I would like to predict the switching behavior of time-dependent signals. Currently the signal has 3 states (1, 2, 3), but it could be that this will change in the future. For the moment, however, it is absolutely okay to assume three states.
I can make the following assumptions about these states (see picture):
the signals repeat periodically, possibly with variations concerning the time of day.
the duration of state 2 is always constant and relatively short for all signals.
the duration of states 1 and 3 are also constant, but vary for the different signals.
the switching sequence is always the same: 1 --> 2 --> 3 --> 2 --> 1 --> [...]
there is a constant but unknown time reference between the different signals.
There is no constant time reference between my observations for the different signals. They are simply measured one after the other, but always at different times.
I am able to rebuild my model periodically after i obtained more samples.
I have the following problems:
I can only observe one signal at a time.
I can only observe the signals at different times.
I cannot trigger my measurement with the state transition. That means, when I measure, I am always "in the middle" of a state. Therefore I don't know when this state has started and also not exactly when this state will end.
I cannot observe a certain signal for a long duration. So, i am not able to observe a complete period.
My samples (observations) are widespread in time.
I would like to get a prediction either for the state change or the current state for the current time. It is likely to happen that i will never have measured my signals for that requested time.
So far I have tested the TimeSeriesPredictor from the ML.NET Toolbox, as it seemed suitable to me. However, in my opinion, this algorithm requires that you always pass only the data of one signal. This means that assumption 5 is not included in the prediction, which is probably suboptimal. Also, in this case I had problems with the prediction not changing, which should actually happen time-dependently when I query multiple predictions. This behavior led me to believe that only the order of the values entered the model, but not the associated timestamp. If I have understood everything correctly, then exactly this timestamp is my most important "feature"...
So far, i did not do any tests on Regression-based approaches, e.g. FastTree, since my data is not linear, but keeps changing states. Maybe this assumption is not valid and regression-based methods could also be suitable?
I also don't know if a multiclassifier is required, because I had understood that the TimeSeriesPredictor would also be suitable for this, since it works with the single data type. Whether the prediction is 1.3 or exactly 1.0 would be fine for me.
To sum it up:
I am looking for a algorithm which is able to recognize the switching patterns based on lose and widespread samples. It would be okay to define boundaries, e.g. state duration 3 of signal 1 will never last longer than 30s or state duration 1 of signal 3 will never last longer 60s.
Then, after the algorithm has obtained an approximate model of the switching behaviour, i would like to request a prediction of a certain signal state for a certain time.
Which methods can I use to get the best prediction, preferably using the ML.NET toolbox or based on matlab?
Not sure if this is quite what you're looking for, but if detecting spikes and changes using signals is what you're looking for, check out the anomaly detection algorithms in ML.NET. Here are two tutorials that show how to use them.
Detect anomalies in product sales
Spike detection
Change point detection
Detect anomalies in time series
Detect anomaly period
Detect anomaly
One way to approach this would be to first determine the periodicity of each of the signals independently. This could be done by looking at the frequency distribution of time differences between measurements of state 2 only and separately for each signal.
This will give a multinomial distribution. The shortest time difference will be the duration of the switching event (after discarding time differences less than the max duration of state 2). The second shortest peak will be the duration between the end of one switching event and the start of the next.
When you have the 3 calculations of periodicity you can simply calculate the difference between each of them. Given you have the timestamps of the measurements of state 2 for each signal you should be able to calculate the time of switching for all other signals.

Average result of 50 Netlogo Simulation_Agent Based Simulation

I run an infectious disease spread model similar to "VIRUS" model in the model library changing the "infectiousness".
I did 20 runs each for infectiousness values 98% , 95% , 93% and the Maximum infected count was 74.05 , 73 ,78.9 respectively. (peak was at tick 38 for all 3 infectiousness values)
[I took the average of the infected count for each tick and took the maximum of these averages as the "maximum infected".]
I was expecting the maximum infected count to decrease when the infectiousness is reduced, but it didn't. As per what I understood this happens, because I considered the average values of each simulation run. (It is like I am considering a new simulation run with average infected count for each tick ).
I want to say that, I am considering all 20 simulation runs. Is there a way to do that other than the way I used the average?
In the Models Library Virus model with default parameter settings at other values, and those high infectiousness values, what I see when I run the model is a periodic variation in the numbers three classes of person. Look at the plot in the lower left corner, and you'll see this. What is happening, I believe, is this:
When there are many healthy, non-immune people, that means that there are many people who can get infected, so the number of infected people goes up, and the number of healthy people goes down.
Soon after that, the number of sick, infectious people goes down, because they either die or become immune.
Since there are now more immune people, and fewer infectious people, the number of non-immune healthy grows; they are reproducing. (See "How it works" in the Info tab.) But now we have returned to the situation in step 1, ... so the cycle continues.
If your model is sufficiently similar to the Models Library Virus model, I'd bet that this is part of what's happening. If you don't have a plot window like the Virus model, I recommend adding it.
Also, you didn't say how many ticks you are running the model for. If you run it for a short number of ticks, you won't notice the periodic behavior, but that doesn't mean it hasn't begun.
What this all means that increasing infectiousness wouldn't necessarily increase the maximum number infected: a faster rate of infection means that the number of individuals who can infected drops faster. I'm not sure that the maximum number infected over the whole run is an interesting number, with this model and a high infectiousness value. It depends what you are trying to understand.
One of the great things about NetLogo and some other ABM systems is that you can watch the system evolve over time, using various tools such as plots, monitors, etc. as well as just looking at the agents move around or change states over time. This can help you understand what is going on in a way that a single number like an average won't. Then you can use this insight to figure out a more informative way of measuring what is happening.
Another model where you can see a similar kind of periodic pattern is Wolf-Sheep Predation. I recommend looking at that. It may be easier to understand the pattern. (If you are interested in mathematical models of this kind of phenomenon, look up Lotka-Volterra models.)
(Real virus transmission can be more complicated, because a person (or other animal) is a kind of big "island" where viruses can reproduce quickly. If they reproduce too quickly, this can kill the host, and prevent further transmission of the virus. Sometimes a virus that reproduces more slowly can harm more people, because there is time for them to infect others. This blog post by Elliott Sober gives a relatively simple mathematical introduction to some of the issues involved, but his simple mathematical models don't take into account all of the complications involved in real virus transmission.)
EDIT: You added a comment Lawan, saying that you are interested in modeling COVID-19 transmission. This paper, Variation and multilevel selection of SARS‐CoV‐2 by Blackstone, Blackstone, and Berg, suggests that some of the dynamics that I mentioned in the preceding remarks might be characteristic of COVID-19 transmission. That paper is about six months old now, and it offered some speculations based on limited information. There's probably more known now, but this might suggest avenues for further investigation.
If you're interested, you might also consider asking general questions about virus transmission on the Biology Stackexchange site.

How to make an reinforcement learning agent learn an endless runner?

I'm tried to train a reinforcement learning agent to play an endless runner game using Unity-ML.
The game is simple: an obstacle is approaching from the side and the agent has to jump at the right timing to overcome it.
As the observation, I have the distance to the next obstacle. Possible actions are 0 - idle; 1 - jump. Rewards are given for longer playtime.
Unfortunately, the agent fails to learn to overcome even the 1st obstacle reliable. I guess this is due too high imbalance on the two actions as the ideal policy would be doing nothing (0) most of the time and jump (1) only at very specific points in time. Additionally, all actions during a jump are meaningless since the agent cannot jump while in the air.
How can I improve the learning such that it convergence nevertheless? Any suggestions what to look into?
Current trainer config:
EndlessRunnerBrain:
gamma: 0.99
beta: 1e-3
epsilon: 0.2
learning_rate: 1e-5
buffer_size: 40960
batch_size: 32
time_horizon: 2048
max_steps: 5.0e6
Thanks!
It's difficult to say without seeing the exact code that's being used for the reinforcement learning algorithm. Here are some steps worth exploring:
How long are you letting the agent train? Depending on the complexity of the game environment, it very well may take thousands of episodes for the agent to learn to avoid its first obstacle.
Experiment with the Frameskip property of the Academy object. This permits the agent to only take an action after a number of frames have passed. Increasing this value may increase the speed of learning in more simple games.
Adjust the learning rate. The learning rate determines how heavily the agent weights new information versus old information. You're using a very small learning rate; try increasing it by a couple decimal places.
Adjust epsilon. Epsilon determines how often a random action is taken. Given a state and an epsilon rate of 0.2, your agent will take a random action 20% of the time. The other 80% of the time, it will choose the (state, action) pair with the highest associated reward. You can try reducing or increasing this value to see if you get better results. Since you know you'll want more random actions in the beginning of training, you can even "decay" epsilon with each episode. If you start with an epsilon value of 0.5, after each game episode is completed, reduce epsilon by a small value, say 0.00001 or so.
Change the way the agent is rewarded. Instead of rewarding the agent for each frame it stays alive, perhaps you could reward the agent for each obstacle it successfully jumps over.
Are you sure that the given time_horizon and max_steps provide enough runway for the game to complete an episode?
Hope this helps, and best of luck!

How to make simulated electric components behave nicely?

I'm making a simple electric circuit simulator. It will (at least initially) only feature batteries, wires and resistors in series and parallel. However, I'm at a loss how best to simulate said circuit in a good way.
Specifically, I will have batteries and resistors with two contact points each, and wires that go between two contact points. I assume that each component will have a field for its resistance, the current through it and the voltage across it (current and voltage will, of course, be signed). Each component is given a resistance, and the batteries are given a voltage. The goal of the simulation is to assign correct values to all the other fields in real time as the player connects and disconnects components and wires.
These are the requirements:
It must be correct, including Ohm's and Kirchhoff's laws (I'm modeling real world circuits, and there is little point if the model does something completely different)
It must be numerically stable (we can't have uncontrolled oscillations or something just because two neighbouring resistors can't make up their minds together)
It should stabilize relatively quickly for, let's say, fewer than 30 components (having to wait a few seconds before the values are correct doesn't really satisfy "real time", but I really don't plan on using it for more than 10 or maybe 20 components)
The optimal formulation for me (how I envision this in my head) would be if I could assign a script to each component that took care of that component only, possibly by communicating field values with neighbouring components, and each component script works in parallel and adjusts as is needed
I only see problems here and no solutions. The biggest problem, I think, is Kirchhoff's voltage law (going around any sub-circuit, the voltage across all components, including signs, add up to 0), because that's a global law (it says somehting about a whole circuit and not just a single component / connection point). There is a mathematical reformulation saying that there exists a potential function on the points in the circuit (for instance, the voltage measured against the + pole of the battery), which is a bit more local, but I still don't see how to let a component know how much the voltage / potential drops across it.
Kirchhoff's current law (the net current flow into an intersection is 0) might also be trouble. It seems to force me to make intersections into separate objects to enforce it. I originally thought that I could just let each component have two lists (a left list and a right list) containing every other component that is connected to it at that point, but that might not make KCL easily enforcable.
I know there are circuit simulators out there, and they must have solved this exact problem somehow. I just can't find an explanation because if I try googling it, I only find the already made simulators and no explanations anywhere.

Finding Conditional Moments in a Markov Process

This question combines math and programming. I will first describe the general problem and then give an example that is (hopefully) simpler to understand.
General Question: Consider a Markov-chain process of N-states with transition matrix Π. Each state is associated with a value x_n (n in {1,…,n}). Our goal is to find the unconditional average of the first two moments (mean and var) along T-period paths conditional on (i) the path starts in a subset of states, N_0, (ii) it ends in a subset of states, N_T, and (iii) it is not going through a subset of states, N_not, in any of the periods between 1 to T-1. By saying we are interested in the unconditional average of these two moments, I basically mean what would be the average of these two moments in the stationary distribution. To be more concrete, let me illustrate the goal of the exercise in a simple case.
Simple Example: Consider a 3-state Markov-chain process with transition matrix Π, and let the three state be denoted by A, B, and C. Each of these states are associated with some value (x_A, x_B, and x_C), respectively. We are interested in what happens along paths that satisfy the following condition. The path starts at point A, after 3 periods are in either points B or C, and between periods 1 to 3 never go again through point A. Denote this condition by (#). So, for example, a path which we are interested in would be {A,B,B,C} with the associated values {x_A, x_B, x_B, x_C}. We are interested in the average and standard deviation along such paths. In particular, we would like to find the unconditional average of these first two moments in paths that satisfy (#).
Let me now propose a solution based on simulating the process. Since both T and N are quite large, this solution is too slow for my purpose.
Simulation Solution: Starting from some initial point simulate the process for a very long time period, and drop the first τ periods. Extract all paths along the simulation that satisfy condition (#) and compute the mean and std along each of these paths. Finally, simply take the average across these paths.
I’m hoping there is a better and more efficient way to achieve the goal. Since I want the solution to be accurate and the size of T and N the simulation takes a long time.
I would love to hear your thoughts and if you know of efficient methods to achieve this goal. Please let me know if something is not clear and I'll try to clarify it.
Thank you!!!
I think I know how to do this if N_0 consists of one state, let's call that state A.
The long run probability of being in A is pi(A) and can be obtained by solving pi = pi*P, with P the transition matrix.
The other thing you need to calculate is the probability of those transient paths. You probably need to introduce a modified P, where all states i in the set N_not are absorbing (i.e. P[i,i]=1 and P[i,j]=0 for j is not i). Then starting from a vector p(0) which has a 1 in the element corresponding to state A and 0 otherwise, you can keep calculating p(n) = p(n-1)*P to get the probabilities of your transient paths.
Multiply the result of that by pi(A) to get the unconditional probability.
You can probably do something like this as well when N_0 is a set, but I don't know how you should select p(0) in that case.