What is a good fitness function for an AI of a zero-sum game? - swift

I am making an AI for a zero-sum 4-player board game. It's actually not zero-sum (the 4 players will "die" when they lose all their lives, so there will be a player who died first, second, third and a player who survived. However, I am telling the AI that only surviving counts as a win and anything else is a lose) After some research, I figured I would use a minimax algorithm in combination with a heuristic function. I came across this question and decided to do the same as the OP of that question - write an evolutionary algorithm that gives me the best weights.
However, my heuristic function is different from the one the OP of that question had. Mine takes 9 weights and is a lot slower, so I can't let the agents play 1000 games (takes too much time) or breed them with the crossover method (how do I do a crossover with 9 weights?).
So I decided to come up with my own method of determining fitness and breeding. And this question is only about the fitness function.
Here are my attempts at this.
First Attempt
For each agent A in a randomly generated population of 50 agents, select 3 more agents from the population (with replacement but not the same agent as A itself) and let the 4 agents play a game where A is the first player. Select another 3 and play a game where A is the second player, and so on. For each of these 4 games, if A died first, its fitness does not change. If A died second, its fitness is increased by 1. If it died third, its fitness is increased by 2. If it survived, its fitness is increased by 3. Therefore, I concluded that the highest fitness one can get is 12 (surviving/wining all 4 games -> 3 + 3 + 3 + 3).
I ran this for 7 generations and starting from the first generation, the highest fitness is as high as 10. And I calculated the average fitness of the top 10 agents, but the average didn't increase a bit throughout the 7 generations. It even decreased a little.
I think the reason why this didn't work is because there's gotta be a few agents that got lucky and got some poor performing agents as its opponents.
Second Attempt
The game setups are the same as my first attempt but instead of measuring the results of each game, I decided to measure how many moves did that agent make before it died.
After 7 generations the average fitness of top 10 does increase but still not increasing as much as I think it should.
I think the reason why this failed is that the game is finite, so there is a finite number of moves you can make before you die and the top performing agents pretty much reached that limit. There is no room for growth. Another reason is that the fitness of the player who survived and the fitness of the player who died third differs little.
What I want
From my understanding of EAs (correct me if I'm wrong), the average fitness should increase and the top performing individual's fitness should not decrease over time.
My two attempts failed at both of these. Since the opponents are randomly selected, the top performing agent in generation 1 might get stronger opponents in the next generation, and thus its fitness decreases.
Notes
In my attempts, the agents play 200 games each generation and each generation takes up to 3 hours, so I don't want to let them play too many games.
How can I write a fitness function like this?

Seven generations doesn't seem like nearly enough to get a useful result. Especially for a game, I would expect something like 200+ generations to be more realistic. You could do a number of things:
Implement elitism in order to ensure the survival of the best individual(s).
The strength of evolution stems from repeated mutation and crossover, so I'd recommend letting the agents play only a few games per generation (say, 5 ~ 10), at least at the beginning, and then evolve the population. You might even want to do only one game per generation.
In this regard, you could adopt a continuous evolution strategy. What this means is that as soon as an agent dies, they are subjected to mutation, and as soon as an agent wins, they can produce offspring. Or any combination of the two. The point is that the tournament is ongoing, everyone can play against anyone else. This is a little more "organic" in the sense that it does not have strictly defined generations, but it should speed up the process (especially if you can parallelise the evaluation).
I hope that helps. The accepted answer in the post you referenced has a good suggestion about the way you could implement crossover.

Related

Observation Space for race strategy development - Reinforcement learning

I refrained from asking for help until now, but as my thesis' deadline creeps ever closer and I do not know anybody with experience in RL, I'm trying my luck here.
TLDR;
I have not found an academic/online resource which helps me understand the correct representation of the environment as an observation space. I would be very thankful for any links or for giving me a starting point of how to model the specifics of my environment in an observation space.
Short thematic introduction
The goal of my research is to determine the viability of RL for strategy development in motorsports. This is currently achieved by simulating (lots of!) races and calculating the resulting race time (thus end-position) of different strategic decisions (which are the timing of pit stops + amount of laps to refuel for). This demands a manual input of expected inlaps (the lap a pit stop occurs) for all participants, which implicitly limits the possible strategies by human imagination as well as the amount of possible simulations.
Use of RL
A trained RL agent could decide on its own when to perform a pit stop and how much fuel should be added, in order to minizime the race time and react to probabilistic events in the simulation.
The action space is discrete(4) and represents the options to continue, pit and refuel for 2,4,6 laps respectively.
Problem
The observation space is of POMDP nature and needs to model the agent's current race position (which I hope is enough?). How would I implement the observation space accordingly?
The training is performed using OpenAI's Gym framework, but a general explanation/link to article/publication would also be appreciated very much!
Your observation could be just an integer which represents round or position the agent is in. This is obviously not a sufficient representation so you need to add more information.
A better observation could be the agents race position x1, the round the agent is in x2 and the current fuel in the tank x3. All three of these can be represented by a real number. Then you can create your observation by concating these to a vector obs = [x1, x2, x3].

Average result of 50 Netlogo Simulation_Agent Based Simulation

I run an infectious disease spread model similar to "VIRUS" model in the model library changing the "infectiousness".
I did 20 runs each for infectiousness values 98% , 95% , 93% and the Maximum infected count was 74.05 , 73 ,78.9 respectively. (peak was at tick 38 for all 3 infectiousness values)
[I took the average of the infected count for each tick and took the maximum of these averages as the "maximum infected".]
I was expecting the maximum infected count to decrease when the infectiousness is reduced, but it didn't. As per what I understood this happens, because I considered the average values of each simulation run. (It is like I am considering a new simulation run with average infected count for each tick ).
I want to say that, I am considering all 20 simulation runs. Is there a way to do that other than the way I used the average?
In the Models Library Virus model with default parameter settings at other values, and those high infectiousness values, what I see when I run the model is a periodic variation in the numbers three classes of person. Look at the plot in the lower left corner, and you'll see this. What is happening, I believe, is this:
When there are many healthy, non-immune people, that means that there are many people who can get infected, so the number of infected people goes up, and the number of healthy people goes down.
Soon after that, the number of sick, infectious people goes down, because they either die or become immune.
Since there are now more immune people, and fewer infectious people, the number of non-immune healthy grows; they are reproducing. (See "How it works" in the Info tab.) But now we have returned to the situation in step 1, ... so the cycle continues.
If your model is sufficiently similar to the Models Library Virus model, I'd bet that this is part of what's happening. If you don't have a plot window like the Virus model, I recommend adding it.
Also, you didn't say how many ticks you are running the model for. If you run it for a short number of ticks, you won't notice the periodic behavior, but that doesn't mean it hasn't begun.
What this all means that increasing infectiousness wouldn't necessarily increase the maximum number infected: a faster rate of infection means that the number of individuals who can infected drops faster. I'm not sure that the maximum number infected over the whole run is an interesting number, with this model and a high infectiousness value. It depends what you are trying to understand.
One of the great things about NetLogo and some other ABM systems is that you can watch the system evolve over time, using various tools such as plots, monitors, etc. as well as just looking at the agents move around or change states over time. This can help you understand what is going on in a way that a single number like an average won't. Then you can use this insight to figure out a more informative way of measuring what is happening.
Another model where you can see a similar kind of periodic pattern is Wolf-Sheep Predation. I recommend looking at that. It may be easier to understand the pattern. (If you are interested in mathematical models of this kind of phenomenon, look up Lotka-Volterra models.)
(Real virus transmission can be more complicated, because a person (or other animal) is a kind of big "island" where viruses can reproduce quickly. If they reproduce too quickly, this can kill the host, and prevent further transmission of the virus. Sometimes a virus that reproduces more slowly can harm more people, because there is time for them to infect others. This blog post by Elliott Sober gives a relatively simple mathematical introduction to some of the issues involved, but his simple mathematical models don't take into account all of the complications involved in real virus transmission.)
EDIT: You added a comment Lawan, saying that you are interested in modeling COVID-19 transmission. This paper, Variation and multilevel selection of SARS‐CoV‐2 by Blackstone, Blackstone, and Berg, suggests that some of the dynamics that I mentioned in the preceding remarks might be characteristic of COVID-19 transmission. That paper is about six months old now, and it offered some speculations based on limited information. There's probably more known now, but this might suggest avenues for further investigation.
If you're interested, you might also consider asking general questions about virus transmission on the Biology Stackexchange site.

How to make an reinforcement learning agent learn an endless runner?

I'm tried to train a reinforcement learning agent to play an endless runner game using Unity-ML.
The game is simple: an obstacle is approaching from the side and the agent has to jump at the right timing to overcome it.
As the observation, I have the distance to the next obstacle. Possible actions are 0 - idle; 1 - jump. Rewards are given for longer playtime.
Unfortunately, the agent fails to learn to overcome even the 1st obstacle reliable. I guess this is due too high imbalance on the two actions as the ideal policy would be doing nothing (0) most of the time and jump (1) only at very specific points in time. Additionally, all actions during a jump are meaningless since the agent cannot jump while in the air.
How can I improve the learning such that it convergence nevertheless? Any suggestions what to look into?
Current trainer config:
EndlessRunnerBrain:
gamma: 0.99
beta: 1e-3
epsilon: 0.2
learning_rate: 1e-5
buffer_size: 40960
batch_size: 32
time_horizon: 2048
max_steps: 5.0e6
Thanks!
It's difficult to say without seeing the exact code that's being used for the reinforcement learning algorithm. Here are some steps worth exploring:
How long are you letting the agent train? Depending on the complexity of the game environment, it very well may take thousands of episodes for the agent to learn to avoid its first obstacle.
Experiment with the Frameskip property of the Academy object. This permits the agent to only take an action after a number of frames have passed. Increasing this value may increase the speed of learning in more simple games.
Adjust the learning rate. The learning rate determines how heavily the agent weights new information versus old information. You're using a very small learning rate; try increasing it by a couple decimal places.
Adjust epsilon. Epsilon determines how often a random action is taken. Given a state and an epsilon rate of 0.2, your agent will take a random action 20% of the time. The other 80% of the time, it will choose the (state, action) pair with the highest associated reward. You can try reducing or increasing this value to see if you get better results. Since you know you'll want more random actions in the beginning of training, you can even "decay" epsilon with each episode. If you start with an epsilon value of 0.5, after each game episode is completed, reduce epsilon by a small value, say 0.00001 or so.
Change the way the agent is rewarded. Instead of rewarding the agent for each frame it stays alive, perhaps you could reward the agent for each obstacle it successfully jumps over.
Are you sure that the given time_horizon and max_steps provide enough runway for the game to complete an episode?
Hope this helps, and best of luck!

Economy Producer/Consumer Simulation

Hello wonderful community!
I'm currently writing a small game in my spare time. It takes place in a large galaxy, where the player has control of some number of Stars. On these stars you can construct Buildings, each of which has some number (0..*) of inputs, and produce some number of outputs. These buildings have a maximum capacity/throughput, and scaling down it's inputs scales down it's outputs by an equal amount. I'd like to find a budgeting algorithm that optimizes (or approximates) the throughput of all the buildings. It seems like some kind of max-flow problem, but none of the flow optimization algorithms I've read have differing types of inputs or dependent outputs.
The toy "tech tree" I've been playing with is:
Solar plant - None => 2 energy output.
Extractor - 1 energy => 1 ore output
Refinery - 1 energy, 1 ore => 1 metal
Shipyard - 1 metal, 2 energy => 1 ship
I'm willing to accept sub-optimal algorithms, and I'm willing to make the guarantee that the inputs/outputs have no cycles (they form a DAG from building to building). The idea is to allow reasonable throughput and tech tree complexity, without player intervention, because on the scale of hundreds or thousands of stars, allowing the player to manually define the budgeting strategy isn't fun and gives players who no-life it a distinct advantage.
My current strategy is to build up a DAG, and give the resources a total ordering (Ships are better than Metal is better than Ore is better than energy), then, looping through each of the resources, find the most "descendant" building which produces that resource, allow it to greedily grab from it's inputs recursively (a shipyard would take 2 energy, and 1 metal, and then the refinery would grab 1 energy and 1 ore, etc), then find any "liars" in the graph (the solar plant is providing 4 energy, when it's maximum is 2), scale down their production and propagate the changes forward. Once everything is resolved for the DAG, remove the terminal element (shipyard) from the graph and subtract the "current thruoghput" of each edge from the maximum throughput of the building, and then repeat the process for the next type of resource. I thought I'd ask people far more intelligent than me if there's a better way. :)

Simulating sports matches in online game

In an online manager game (like Hattrick), I want to simulate matches between two teams.
A team consists of 11 players. Every player has a strength value between 1 and 100. I take these strength values of the defensive players for each team and calculate the average. That's the defensive quality of a team. Then I take the strengths of the offensive players and I get the offensive quality.
For each attack, I do the following:
$offFactor = ($attackerTeam_offensive-$defenderTeam_defensive)/max($attackerTeam_offensive, $defenderTeam_defensive);
$defFactor = ($defenderTeam_defensive-$attackerTeam_offensive)/max($defenderTeam_defensive, $attackerTeam_offensive);
At the moment, I don't know why I divide it by the higher one of both values. But this formula should give you a factor for the quality of offense and defense which is needed later.
Then I have nested conditional statements for each event which could happen. E.g.: Does the attacking team get a scoring chance?
if ((mt_rand((-10+$offAdditionalFactor-$defAdditionalFactor), 10)/10)+$offFactor >= 0)
{ ... // the attack succeeds
These additional factors could be tactical values for example.
Do you think this is a good way of calculating a game? My users say that they aren't satisfied with the quality of the simulations. How can I improve them? Do you have different approaches which could give better results? Or do you think that my approach is good and I only need to adjust the values in the conditional statements and experiment a bit?
I hope you can help me. Thanks in advance!
Here is a way I would do it.
Offensive/Defensive Quality
First lets work out the average strength of the entire team:
Team.Strength = SUM(Players.Strength) / 11
Now we want to split out side in two, and work out the average for our defensive players, and our offensive players.]
Defense.Strength = SUM(Defensive_Players.Strength)/Defensive_Players.Count
Offense.Strength = SUM(Offense_Players.Strength)/Offense_Players.Count
Now, we have three values. The first, out Team average, is going to be used to calculate our odds of winning. The other two, are going to calculate our odds of defending and our odds of scoring.
A team with a high offensive average is going to have more chances, a team with a high defense is going to have more chance at saving.
Now if we have to teams, lets call them A and B.
Team A, have an average of 80, An offensive score of 85 and a defensive score of 60.
Team B, have an average of 70, An offensive score of 50 and a defensive score of 80.
Now, based on the average. Team A, should have a better chance at winning. But by how much?
Scoring and Saving
Lets work out how many times goals Team A should score:
A.Goals = (A.Offensive / B.Defensive) + RAND()
= (85/80) + 0.8;
= 1.666
I have assumed the random value adds anything between -1 and +1, although you can adjust this.
As we can see, the formula indicates team A should score 1.6 goals. we can either round this up/down. Or give team A 1, and calculate if the other one is allowed (random chance).
Now for Team B
B.Goals = (B.Offensive / A.Defensive) + RAND()
= (50/60) + 0.2;
= 1.03
So we have A scoring 1 and B scoring 1. But remember, we want to weight this in A's favour, because, overall, they are the better team.
So what is the chance A will win?
Chance A Will Win = (A.Average / B.Average)
= 80 / 70
= 1.14
So we can see the odds are 14% (.14) in favor of A winning the match. We can use this value to see if there is any change in the final score:
if Rand() <= 0.14 then Final Score = A 2 - 1 B Otherwise A 1 - 1 B
If our random number was 0.8, then the match is a draw.
Rounding Up and Further Thoughts
You will definitely want to play around with the values. Remember, game mechanics are very hard to get right. Talk to your players, ask them why they are dissatisfied. Are there teams always losing? Are the simulations always stagnant? etc.
The above outline is deeply affected by the randomness of the selection. You will want to normalise it so the chances of a team scoring an extra 5 goals is very very rare. But a little randomness is a great way to add some variety to the game.
There are ways to edit this method as well. For example instead of the number of goals, you could use the Goal figure as the number of scoring chances, and then have another function that worked out the number of goals based on other factors (i.e. choose a random striker, and use that players individual stats, and the goalies, to work out if there is a goal)
I hope this helps.
The most basic tactical decision in football is picking formation, which is a set of three numbers which assigns the 10 outfield players to defence, midfield and attack, respectively, e.g. 4/4/2.
If you use average player strength, you don't merely lose that tactic, you have it going backwards: the strongest defence is one with a single very good player, giving him any help will make it more likely the other team score. If you have one player with a rating of 10, the average is 10. Add another with rating 8, and the average drops (to 9). But assigning more people to defence should make it stronger, not weaker.
So first thing, you want to make everything be based on the total, not the average. The ratio between the totals is a good scale-independent way of determining which teams is stronger and by how much. Ratios tend to be better than differences, because they work in a predictable way with teams of any range of strengths. You can set up a combat results table that says how many goals are scored (per game, per half, per move, or whatever).
The next tactical choice is whether it is better to have one exceptional player, or several good ones. You can make that matter that by setting up scenarios that represent things that happen in game, e.g. a 1 on 1, a corner, or a long ball. The players involved in a scenario are first randomly chosen, then the result of the scenario is rolled for. One result can be that another scenario starts (midfield pass leads to cross leads to header chance).
The final step, which would bring you pretty much up to the level of actual football manager games, is to give players more than one type of strength rating, e.g., heading, passing, shooting, and so on. Then you use the strength rating appropriate to the scenario they are in.
The division in your example is probably a bad idea, because it changes the scale of the output variable depending on which side is better. Generally when comparing two quantities you either want interval data (subtract one from the other) or ratio data (divide one by the other) but not both.
A better approach in this case would be to simply divide the offensive score by the defensive score. If both are equal, the result will be 1. If the attacker is better than the defender, it will be greater than 1, and if the defender is stronger, it will be less than one. These are easy numbers to work with.
Also, instead of averaging the whole team, average parts of the team depending on the formations or tactics used. This will allow teams to choose to play offensively or defensively and see the pros and cons of this.
And write yourself some better random number generation functions. One that returns floating point values between -1 and 1 and one that works from 0 to 1, for starters. Use these in your calculations and you can avoid all those confusing 10s everywhere!
You might also want to ask the users what about the simulation they don't like. It's possible that, rather than seeing the final outcome of the game, they want to know how many times their team had an opportunity to attack but the defense regained control. So instead of
"Your team wins 2-1"
They want to see match highlights:
"Your team wins 2-1:
- scored at minute 15,
- other team took control and went for tried for a goal at minute 30,
but the shoot was intercepted,
- we took control again and $PLAYER1 scored a beautiful goal!
... etc
You can use something like what Jamie suggests for a starting point, choose the times at random, and maybe pick who scored the goal based on a weighted sampling of the offensive players (i.e. a player with a higher score gets a higher chance of being the one who scored). You can have fun and add random low-probability events like a red card on a player, someone injuring themselves, streakers across the field...
The average should be the number of players... using the max means if you have 3 player teams:
[4 4 4]
[7 4 1]
The second one would be considered weaker. Is that what you want? I think you would rather do something like:
(Total Scores / Total Players) + (Max Score / Total Players), so in the above example it would make the second team slightly better.
I guess it depends on how you feel the teams should be balanced.