How should one set up the immediate reward in a RL program? - neural-network

I want my RL agent to reach the goal as quickly as possible and at the same time to minimize the number of times it uses a specific resource T (which sometimes though is necessary).
I thought of setting up the immediate rewards as -1 per step, an additional -1 if the agent uses T and 0 if it reaches the goal.
But the additional -1 is completely arbitrary, how do I decide how much punishment should the agent get for using T?

You should use a reward function which mimics your own values. If the resource is expensive (valuable to you), then the punishment for consuming it should be harsh. The same thing goes for time (which is also a resource if you think about it).
If the ratio between the two punishments (the one for time consumption and the one for resource consumption) is in accordance to how you value these resources, then the agent will act precisely in your interest. If you get it wrong (because maybe you don't know the precise cost of the resource nor the precise cost of slow learning), then it will strive for a pseudo optimal solution rather than an optimal one, which in a lot of cases is okay.

Related

Does OptaPlanner have a "built-in" way to perform multi-unit score normalization?

At the moment, my problem has four metrics. Each of these measures something entirely different (each has different units, a different range, etc.) and each is weighted externally. I am using Drools for scoring.
I only have only one score level (SimpleLongScore) and I have to find a way to appropriately combine the individual scores of these metrics onto one long value
The most significant problem at the moment is that the range of values for the metrics can be wildly different.
So if, for example, after a move the score of a metric with a small possible range improves by, say, 10%, that could be completely dwarfed by an alternate move which improves the metric with a larger range's score by only 1% because OptaPlanner only considers the actual score value rather than the possible range of values and how changes affect them proportionally (to my knowledge).
So, is there a way to handle this cleanly which is already part of OptaPlanner that I cannot find?
Is the only feasible solution to implement Pareto scoring? Because that seems like a hack-y nightmare.
So far I have code/math to compute the best-possible and worst-possible scores for a metric that I access from within the Drools and then I can compute where in that range a move puts us, but this also feel quite hack-y and will cause issues with incremental scoring if we want to scale non-linearly within that range.
I keep coming back to thinking I should just just bite the bullet and implement Pareto scoring.
Thanks!
Take a look at #ConstraintConfiguration and #ConstraintWeight in the docs.
Also take a look at the chapter "explaning the score", which can exactly tell you which constraint had which score impact on the best solution found.
If, however, you need pareto optimization, so you need multiple best solutions that don't dominate each other, then know that OptaPlanner doesn't support that yet, but I know of 2 cases that implemented it in OptaPlanner by hacking BestSolutionRecaller.
That being said, 99% of the cases that think of pareto optimization, are 100% happy with #ConstraintWeight instead, because users don't want multiple best solutions (except during simulations), they just want one in production.

For a Single Cycle CPU How Much Energy Required For Execution Of ADD Command

The question is obvious like specified in the title. I wonder this. Any expert can help?
OK, this is was going to be a long answer, so long that I may write an article about it instead. Strangely enough, I've been working on experiments that are closely related to your question -- determining performance per watt for a modern processor. As Paul and Sneftel indicated, it's not really possible with any real architecture today. You can probably compute this if you are looking at only the execution of that instruction given a certain silicon technology and a certain ALU design through calculating gate leakage and switching currents, voltages, etc. But that isn't a useful value because there is something always going on (from a HW perspective) in any processor newer than an 8086, and instructions haven't been executed in isolation since a pipeline first came into being.
Today, we have multi-function ALUs, out-of-order execution, multiple pipelines, hyperthreading, branch prediction, memory hierarchies, etc. What does this have to do with the execution of one ADD command? The energy used to execute one ADD command is different from the execution of multiple ADD commands. And if you wrap a program around it, then it gets really complicated.
SORT-OF-AN-ANSWER:
So let's look at what you can do.
Statistically measure running a given add over and over again. Remember that there are many different types of adds such as integer adds, floating-point, double precision, adds with carries, and even simultaneous adds (SIMD) to name a few. Limits: OSs and other apps are always there, though you may be able to run on bare metal if you know how; varies with different hardware, silicon technologies, architecture, etc; probably not useful because it is so far from reality that it means little; limits of measurement equipment (using interprocessor PMUs, from the wall meters, interposer socket, etc); memory hierarchy; and more
Statistically measuring an integer/floating-point/double -based workload kernel. This is beginning to have some meaning because it means something to the community. Limits: Still not real; still varies with architecture, silicon technology, hardware, etc; measuring equipment limits; etc
Statistically measuring a real application. Limits: same as above but it at least means something to the community; power states come into play during periods of idle; potentially cluster issues come into play.
When I say "Limits", that just means you need to well define the constraints of your answer / experiment, not that it isn't useful.
SUMMARY: it is possible to come up with a value for one add but it doesn't really mean anything anymore. A value that means anything is way more complicated but is useful and requires a lot of work to find.
By the way, I do think it is a good and important question -- in part because it is so deceptively simple.

Implementing reinforcement learning in NetLogo (Learning in multi-agent models)

I am thinking to implement a learning strategy for different types of agents in my model. To be honest, I still do not know what kind of questions should I ask first or where to start.
I have two types of agents which I want them to learn by experience, they have a pool of actions which each has different reward based on specific situations that might happen.
I am new to reinforcement Learning methods, therefore any suggestions on what kind of questions should I ask myself is welcomed :)
Here is how I am going forward to formulate my problem:
Agents have a lifetime and they keep track of a few things that matter for them and these indicators are different for different agents, for example, one agent wants to increase A another wants B more than A.
States are points in an agent's lifetime which they
Have more than one option (I do not have a clear definition for
States as they might happen a few times or not happen at all because
Agents move around and they might never face a situation)
The reward is the an increase or decrease in an indicator that agents can get from an action in a specific State, and agent do not know what would be the gain if he chose another action.
The gain is not constant, the states are not well defined and there is no formal transition of one state into another,
For example agent can decide to share with one of the co-located agent (Action 1) or with all of the agents at the same location(Action 2) If certain conditions hold true Action A will be more rewarding for that agent, while in other conditions Action 2 will have higher reward; my problem is I did not see any example with unknown rewards since sharing in this scenario also depends on the other agent's characteristics (which affects the conditions of reward system) and in different states it will be different.
In my model there is no relationship between the action and the following state,and that makes me wonder if its ok to think about RL in this situation at all.
What I am looking to optimize here is the ability for my agents to reason about current situation in a better way and not only respond to their need which is triggered by their internal states. They have a few personalities which can define their long term goal and can affect their decision making in different situations, but I want them to remember what action in a situation helped them to increase their preferred long term goal.
In my model there is no relationship between the action and the following state,and that makes me wonder if its ok to think about RL in this situation at all.
This seems strange. What do actions do if not change state? Note that agents don't have to necessarily know how their actions will change their state. Similarly, actions could change the state imperfectly (a robots treads could skid out so it doesn't actually move when it tries to). In fact, some algorithms are specifically designed for this uncertainty.
In any case, even if the agents are moving around the state space without having any control, it can still learn the rewards for the different states. Indeed, many RL algorithms involve moving around the state space semi-randomly to figure out what the rewards are.
I do not have a clear definition for States as they might happen a few times or not happen at all because Agents move around and they might never face a situation
You might consider expanding what goes into what you consider to be a "state". For instance, the position seems like it should definitely go into the variables identifying a state. Not all states need to have rewards (although good RL algorithms typically infer a measure of goodness of neutral states).
I would recommend clearly defining the variables that determine an agent's state. For instance, the state space could be current-patch X internal-variable-value X other-agents-present. In the simplest case, the agent can observe all of the variables that make up their state. However, there are algorithms that don't require this. An agent should always be in a state, even if the state has no reward value.
Now, concerning unknown reward. That's actually totally okay. Reward can be a random variable. In that case, a simple way to apply standard RL algorithms would be to use the expected value of the variable when making decisions. If the distribution is unknown, then the algorithm could just use the mean of the rewards observed so far.
Alternatively, you could include the variables that determine the reward in the definition of the state. That way, if the reward changes, then it is literally in a different state. For example, suppose a robot is on top of a building. It needs to get to the top of the building in front of it. If it just moves forward, it falls to ground. Thus, that state has a very low reward. However, if it first places a plank that goes from one building to the other, and then moves forward, the reward changes. To represent this, we could include plank-in-place as a variable so that putting the board in place actually changes the robot's current state and the state that would result from moving forward. Thus, the reward itself has not changed; it's just in a different state.
Hopefully this helps!
UPDATE 2/7/2018: A recent upvote reminded me of the existence of this question. In the years since it was asked, I've actually dived into RL in NetLogo to a much greater extent. In particular, I've made a python extension for NetLogo, primarily to make it easier to integrate machine learning algorithms in with model. One of the demos of the extension trains a collection of agents using deep Q-learning as the model runs.

Controlling events in a hybrid Modelica model

I am confused by the hybrid modelling paradigm in Modelica. On one hand, events are useful, on the other hand, they are to be avoided. Let me explain my case:
I have a large model consisting of multiple buildings in a neighborhood that is simulated over 1 year. Initially, the model ran very slow. Adding noEvent() around as many if-conditions as possible drastically improved the speed.
As the development continued, the control of the model got more complicated, and I have again many events, sometimes at very short intervals. To give an idea:
Number of (model) time events : 28170
Number of (U) time events : 0
Number of state events : 22572
Number of step events : 0
These events blow up the output (for correct post-processing I need the variables at events) and slows the simulation. And moreover, I have the feeling that some of the noEvent(if...) lead to unexpected behavior.
I wonder if it would be a solution to force my events at certain time steps and prohibit them in between these time steps? Ideally, I would like to trigger these 'forced events' based on certain conditions. For example: during the day they should be every 15 minutes, at high solar radiation at every minute, during nights I don't want events at all.
Is this a good idea to do? I guess this will be faster as many of the state events will become time events? How can this be done with Modelica 3.2 (in Dymola)?
Thanks on beforehand for all answers.
Roel
A few comments.
First, if you have a simulation with lots of events (relative to the total duration of the simulation), the first thing I would encourage you to do is use a lower order integrator. The point here is that higher-order integrators normally allow you to take longer time steps. But if those steps are constantly truncated by events, they just end up being really expensive.
Second, you could try fixed-step integrators. Depending on the tool, they may implement this kind of "pool events and fire them all at once" kind of approach in the context of fixed-time step integrators. But the specification doesn't really say anything on how tools should deal with events that occur between fixed time steps.
Third, another way to approach this would be to "pool" your events yourself. The simplest way I could imagine doing this would be to take all the statements that currently generate events and wrap them in a "when sample(...,...) then" statement. This way, you could make sure that the events were only triggered at specific intervals. This would be more portable then the fixed time step approach. I think this is what you were actually proposing in your question but it is important to point out that it should not be based on time steps (the model has no concept of a time step) but rather on a model specified sampling interval (which will, in practice, be completely independent of time steps).
As you point out, using "sample(...,...)" will turn these into time events and, yes, this should be faster.

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.