AlphaGo Zero board evaluation function uses multiple time steps as an input... Why? - neural-network

According to AlphaGo Cheat Sheet, AlphaGo Zero uses a sequence of consecutive board configurations to encode its game state.
In theory, all the necessary information is contained in the latest state, and yet they include the previous 7 configurations.
Why did they choose to inject so much complexity ?
What are they listening for ??
AlphaGoZero

The sole reason is because in all games - Go, Chess, and Shogi - there is a repetition rule. What this means is that the game is not fully observable from the current board position. In other words, there may be two identical positions with two very different evaluations. For example in one Go position there may be a winning move, but in an identical Go position that move is either illegal or one of the next few moves in the would-be-winning continuation creates an illegal position.
You could try feeding in only the current board position and handling repetitions in the tree only. But I think this would be weaker because the evaluation function would be wrong in some cases, leading to a horizon effect if that branch of the tree had not been explored deeply enough to correct the problem.

Related

Is there a way to record when (and how often) Transporters get within a certain distance of each other?

I have an AnyLogic simulation model using trucks and forklifts as agents (Transporter type), and among other things would like to identify each time one of them becomes within a certain distance of another one (example within 5 meters). I will record this count as a variable within the main space of the model. I am using path-guided navigation.
I have seen a method "agentsInRange" which will probably be able to do the trick, but am not sure where to call this from. I assume I should be able to use the AL functionality of "Min distance to obstacle" (TransporterFleet) and "Collision detection timeout" (TransporterControl)?
Thanks in advance!
Since there don't seem to be pre-built functions for this, afaik, the easiest way is to:
add an int variable to your transporter agent type counter
add an event to your transporter type checkCollision that triggers every second or so
in the event, loop across the entire population of transporters and count the number that are closer than X meters (use distanceTo(otherTransporter) and write your own custom code)
add that number to counter
Note that this might be very inefficient computationally as it is quite brute-force. But might be good enough :)

How to make simulated electric components behave nicely?

I'm making a simple electric circuit simulator. It will (at least initially) only feature batteries, wires and resistors in series and parallel. However, I'm at a loss how best to simulate said circuit in a good way.
Specifically, I will have batteries and resistors with two contact points each, and wires that go between two contact points. I assume that each component will have a field for its resistance, the current through it and the voltage across it (current and voltage will, of course, be signed). Each component is given a resistance, and the batteries are given a voltage. The goal of the simulation is to assign correct values to all the other fields in real time as the player connects and disconnects components and wires.
These are the requirements:
It must be correct, including Ohm's and Kirchhoff's laws (I'm modeling real world circuits, and there is little point if the model does something completely different)
It must be numerically stable (we can't have uncontrolled oscillations or something just because two neighbouring resistors can't make up their minds together)
It should stabilize relatively quickly for, let's say, fewer than 30 components (having to wait a few seconds before the values are correct doesn't really satisfy "real time", but I really don't plan on using it for more than 10 or maybe 20 components)
The optimal formulation for me (how I envision this in my head) would be if I could assign a script to each component that took care of that component only, possibly by communicating field values with neighbouring components, and each component script works in parallel and adjusts as is needed
I only see problems here and no solutions. The biggest problem, I think, is Kirchhoff's voltage law (going around any sub-circuit, the voltage across all components, including signs, add up to 0), because that's a global law (it says somehting about a whole circuit and not just a single component / connection point). There is a mathematical reformulation saying that there exists a potential function on the points in the circuit (for instance, the voltage measured against the + pole of the battery), which is a bit more local, but I still don't see how to let a component know how much the voltage / potential drops across it.
Kirchhoff's current law (the net current flow into an intersection is 0) might also be trouble. It seems to force me to make intersections into separate objects to enforce it. I originally thought that I could just let each component have two lists (a left list and a right list) containing every other component that is connected to it at that point, but that might not make KCL easily enforcable.
I know there are circuit simulators out there, and they must have solved this exact problem somehow. I just can't find an explanation because if I try googling it, I only find the already made simulators and no explanations anywhere.

Implementing reinforcement learning in NetLogo (Learning in multi-agent models)

I am thinking to implement a learning strategy for different types of agents in my model. To be honest, I still do not know what kind of questions should I ask first or where to start.
I have two types of agents which I want them to learn by experience, they have a pool of actions which each has different reward based on specific situations that might happen.
I am new to reinforcement Learning methods, therefore any suggestions on what kind of questions should I ask myself is welcomed :)
Here is how I am going forward to formulate my problem:
Agents have a lifetime and they keep track of a few things that matter for them and these indicators are different for different agents, for example, one agent wants to increase A another wants B more than A.
States are points in an agent's lifetime which they
Have more than one option (I do not have a clear definition for
States as they might happen a few times or not happen at all because
Agents move around and they might never face a situation)
The reward is the an increase or decrease in an indicator that agents can get from an action in a specific State, and agent do not know what would be the gain if he chose another action.
The gain is not constant, the states are not well defined and there is no formal transition of one state into another,
For example agent can decide to share with one of the co-located agent (Action 1) or with all of the agents at the same location(Action 2) If certain conditions hold true Action A will be more rewarding for that agent, while in other conditions Action 2 will have higher reward; my problem is I did not see any example with unknown rewards since sharing in this scenario also depends on the other agent's characteristics (which affects the conditions of reward system) and in different states it will be different.
In my model there is no relationship between the action and the following state,and that makes me wonder if its ok to think about RL in this situation at all.
What I am looking to optimize here is the ability for my agents to reason about current situation in a better way and not only respond to their need which is triggered by their internal states. They have a few personalities which can define their long term goal and can affect their decision making in different situations, but I want them to remember what action in a situation helped them to increase their preferred long term goal.
In my model there is no relationship between the action and the following state,and that makes me wonder if its ok to think about RL in this situation at all.
This seems strange. What do actions do if not change state? Note that agents don't have to necessarily know how their actions will change their state. Similarly, actions could change the state imperfectly (a robots treads could skid out so it doesn't actually move when it tries to). In fact, some algorithms are specifically designed for this uncertainty.
In any case, even if the agents are moving around the state space without having any control, it can still learn the rewards for the different states. Indeed, many RL algorithms involve moving around the state space semi-randomly to figure out what the rewards are.
I do not have a clear definition for States as they might happen a few times or not happen at all because Agents move around and they might never face a situation
You might consider expanding what goes into what you consider to be a "state". For instance, the position seems like it should definitely go into the variables identifying a state. Not all states need to have rewards (although good RL algorithms typically infer a measure of goodness of neutral states).
I would recommend clearly defining the variables that determine an agent's state. For instance, the state space could be current-patch X internal-variable-value X other-agents-present. In the simplest case, the agent can observe all of the variables that make up their state. However, there are algorithms that don't require this. An agent should always be in a state, even if the state has no reward value.
Now, concerning unknown reward. That's actually totally okay. Reward can be a random variable. In that case, a simple way to apply standard RL algorithms would be to use the expected value of the variable when making decisions. If the distribution is unknown, then the algorithm could just use the mean of the rewards observed so far.
Alternatively, you could include the variables that determine the reward in the definition of the state. That way, if the reward changes, then it is literally in a different state. For example, suppose a robot is on top of a building. It needs to get to the top of the building in front of it. If it just moves forward, it falls to ground. Thus, that state has a very low reward. However, if it first places a plank that goes from one building to the other, and then moves forward, the reward changes. To represent this, we could include plank-in-place as a variable so that putting the board in place actually changes the robot's current state and the state that would result from moving forward. Thus, the reward itself has not changed; it's just in a different state.
Hopefully this helps!
UPDATE 2/7/2018: A recent upvote reminded me of the existence of this question. In the years since it was asked, I've actually dived into RL in NetLogo to a much greater extent. In particular, I've made a python extension for NetLogo, primarily to make it easier to integrate machine learning algorithms in with model. One of the demos of the extension trains a collection of agents using deep Q-learning as the model runs.

How is the Conway's Game of Life implemented in Conway's Game of Life?

I just came across http://www.youtube.com/watch?feature=player_embedded&v=xP5-iIeKXE8 , which is an implementation of Conway's Game of Life in ... Conway's Game of Life.
I think it is possible to do this in theory because the Game of Life is Turing complete, but how is it implemented in this case?
jwz has a recent blog entry that talks about this construction: Turtles, all the way down. Or gliders. Or glider turtles. This quote about the "Outer Totalistic Cellular Automata Meta-Pixel" pretty much says it all, in terms that I certainly don't understand:
The rule is encoded in two columns, each of nine eaters, where one column corresponds to the 'Birth' rule and the other corresponds to 'Survival'. The nine eaters correspond to the nine different quantities of on cells (0 through 8). The presence or absence of the eater indicates whether the cell should be on in the next meta-generation. The state of the eater is read by the collision of two antiparallel LWSSes, which radiates two antiparallel gliders (not unlike an electron-positron reaction in a PET scanner). These gliders then collide into beehives, which are restored by a passing LWSS in Brice's elegant honeybit reaction. If the eater is present, the beehive would remain in its original state, thereby allowing the LWSS to pass unaffected; if the eater is absent, the beehive would be restored, consuming the LWSS in the process. Equivalently, the state of the eater is mapped onto the state of the LWSS.

In FSMs does one State last one clock cycle or more?

Need to design a simple one for school.
More specifically a Moore FSM. Im not sure how state transitions happen, is it next state each clock?
I need to know because im wondering if i can shift a register and add a value to it, all in the same state... Could use wave edges?
EDIT:
I have to design the ALU part with registers as a schematic from gate-level, so no target CPU.
I made the algorith diagram, then put states to function blocks according Moore FSM rules. each block of operations gets one state.
For instance in a state S1, i have the following operations: y0 = shift Reg1 left; y1 = Reg1 = Reg1 + Reg2. So the microcommand that the control part of Moore FSM outputs would be 0000011 (yn...y1,y0). this microcommand should be the input to the ALU part which i need to design. Now i realized y1,y0 will conflict eachother, since both are using Reg1.
Its problematic since I dont actually have the Control part, I have to imagine the core FSM and design only the ALU with registers. This is why i was wondering if i get more than one clock cycle, so i can sequence y0,y1 or do i have to complete the entire operation in one clock?
I plan on making parallel-in, parallel-out non-shift registers, obviously i cant do the two operations of the microcommand at the same time. So what can i do:
1. make extra states? which i really dont want to do
2.use edges of a single clock? (might cause problems?)
3.Assume i get a preset amount of ticks from the clock to complete the microcommand ?
This would make the most sense, but i dont know if its the case.
The control unit does "know" the algorithm and thus how many operations need to be performed
I have to note again, that the control part is totally abstract and i have no idea how this is handled in practice.
A FSM itself has no inherent notion of time (although it can be defined). A Moore machine is simplified model and lacks the ability to even formally represent an ever progressing "time" (without, of course, implementing the counting entirely with states); remember, there is only a finite set of states.
In any case, time can be introduced in an implementation detail of a particular FSM and the amount of time might required to change between particular states might not be constant. (A particular FSM might also map differently to different implementations.) In the case of a clocked system it would require looking into how each "clock" is defined in the implementation; it might be leading edge, trailing edge, both, or something entirely different.
Instead of looking at the FSM here for guidance (it is just the logical progression of states), look at the opcodes (or whatever the implementation is) that the FSM represents, and how the CPU (or whatever the implementation is) in question "executes" them.
(What do the books say? ;-)