Distance between deterministic policies that are not probability distributions - distance

This question asks if there is a way to measure distance between policies that are in fact probability distributions.
In the case of continuous control with deterministic policies where they take a state as input and return an action vector, what would be the best method to measure how close two policies are from each other?
A naive approach that came to my mind would be to:
Run both policies A and B to produce a trajectory each and record all sates visited.
For each state encountered by policy A, ask policy B which action it would take (and vice-versa). Hence we would have, for every state encountered, both A and B action vectors.
For each state, compare action vectors of A and B by using a common distance (Euclidean distance?)
Take the average (maybe maximum) of those distances.
Does it make any sense from a theoretical point of view?

Related

Is PhysX a good match for running lots of similar, short simulations?

I want to use a simplified model of the human body plus some rigid attachments in the prediction portion of an Unscented Kalman Filter. In other words, I will have a few thousand candidate sets of parameters (joint positions, velocities, muscle tensions, etc.), and I will simulate one short time step with each. Then I will take the resulting parameters at the end of the time step and do some linear algebra after adding some information from my sensors. The algebra will generate a new group of parameter sets, allowing me to repeat the process.
The elements of each candidate parameter set will be similar. (They will be points on the surface of a hyperellipsoid aligned with its axes plus the hyperellipsoid's centroid. Or, to put it another way, they will be the mean and the mean +/- N standard deviations of a high-dimensional Gaussian.) But they won't have any other relation to one another.
I'm thinking of using PhysX, but after reading the introductory docs, I can't tell whether it will be a good fit for my problem. Is the simulation portion above an appropriate workload for PhysX?

Output range for continuous control policy network

I tried to implement the simple vanilla policy gradient (REINFORCE) in a continuous control problem by adapting this pytorch implementation to the continuous case and I stumbled upon the following issue.
Usually, when the action space is discrete, the output of the policy network is bounded in (0,1)^n by the softmax function which gives the probability that the agent would pick a certain action given the state (input to the network). However, when the action space is continuous, for example if we have K action such that each action ak has lower and upper bounds lk anduk, I haven't found a way (empirical or theoretical) to limit the output of the network (which is usually the means and the standard deviations of the action probability distribution given the state) using lk and uk.
From the few trials I made, without constraining the output of the policy network, it was very hard, if not impossible, to learn a good policy, but i might be doing something wrong since i am new to reinforcement learning.
My intuition suggests me to limit the means and the standard deviations output of the policy network using, for example, a sigmoid and then scaling them with the absolute difference between lk and uk. I'm not quite sure how to do it properly though, considering also that the sampled action could exceed whatever bound you impose on the distribution parameters when using, for example, a gaussian distribution.
Am I missing something? Are there established ways to limit the output of the policy network for continuous action spaces or there's no need to do that at all?
I am not sure this is the right place for this question, if not I will be glad if you point to me a better place.

How do Markov Chains work and what is memorylessness?

How do Markov Chains work? I have read wikipedia for Markov Chain, But the thing I don't get is memorylessness. Memorylessness states that:
The next state depends only on the current state and not on the
sequence of events that preceded it.
If Markov Chain has this kind of property, then what is the use of chain in markov model?
Explain this property.
You can visualize Markov chains like a frog hopping from lily pad to lily pad on a pond. The frog does not remember which lily pad(s) it has previously visited. It also has a given probability for leaping from lily pad Ai to lily pad Aj, for all possible combinations of i and j. The Markov chain allows you to calculate the probability of the frog being on a certain lily pad at any given moment.
If the frog was a vegetarian and nibbled on the lily pad each time it landed on it, then the probability of it landing on lily pad Ai from lily pad Aj would also depend on how many times Ai was visited previously. Then, you would not be able to use a Markov chain to model the behavior and thus predict the location of the frog.
The idea of memorylessness is fundamental to the success of Markov chains. It does not mean that we don't care about the past. On contrary, it means that we retain only the most relevant information from the past for predicting the future and use that information to define the present.
This nice article provides a good background on the subject
http://www.americanscientist.org/issues/pub/first-links-in-the-markov-chain
There is a trade-off between the accuracy of your description of the past and the size of the associated state space. Say, there are three pubs in the neighborhood and every evening you choose one. If you choose those pubs randomly, this is not a Markov chain (or a trivial, zero-order one) – the outcome is random. More precisely, it is an independent random variable (modeling dependency was fundamental to Markov ideas underlying Markov chains).
In your choice of pubs you can factor in your last choice, i.e., which pub you went to the night before. For example, you might want to avoid going to the same pub two days in a row. While in reality this implies remembering where you have been yesterday (and thus remembering the past!), at your modeling level, your unit of time is one day, so your current state is the pub you went to yesterday. This is your classical (first-order) Markov model with three states and 3 by 3 transition matrix that provides conditional probabilities for each permutation (if yesterday you went to pub I, what is the change that today you “hop” to pub J).
However, you can define a model where you “remember” the last two days. In this second-order Markov model “present” state will include the knowledge of the pub from last night and from the night before. Now you have 9 possible states describing your present state, and therefore you have a 9 by 9 transition matrix. Thankfully, this matrix is not fully populated.
To understand why, consider a slightly different setup when you are so well-organized that you make firm plans for your pub choices both for today and tomorrow based on the last two visits. Then, you can select any possible permutations of pubs visited for two days in a row. The result is a fully populated 9 by 9 matrix that maps your choices for the last two days into those for the next two days. However, in our original problem we make the decision every day, so our future state is constrained by what happened today: at the next time step (tomorrow) today becomes yesterday, but it will be still a part of your definition of "today" at that time step, and relevant to what happens the following day. The situation is analogous to moving averages, or receding horizon procedures. As a result, from a given state, you can only move to three possible states (indicating your today’s choice of pubs), which means that each row of your transition matrix will have only three non-zero entries.
Let us tally up the number of parameters that characterize each problem: the zero- order Markov model with three states has two independent parameters (probabilities of hitting the first and the second pub, as the probability of visiting the third pub is the complement of the first two). The first-order Markov model has a fully populated 3 by 3 matrix with each column summing up to one (again, indicating that one of the pubs will always be visited at any given day), so we end up with six independent parameters. The second-order Markov model has 9 by 9 matrix with each row having only 3 non-zero entries and all columns adding to one, so we have 18 independent parameters.
We can continue defining higher-order models, and our state space will grow accordingly.
Importantly, we can further refine the concept by identifying important features of the past and use only those features to define the present, i.e. compress the information about the past. This is what I referred to in the beginning. For example, instead of remembering all history we can only keep track only of some memorable events that impact our choice, and use this “sufficient statistics” to construct the model.
It all boils down to the way you define relevant variables (state space), and the Markov concepts naturally follow from the underlying fundamental mathematics concepts. First-order (linear) relationships (and associated linear algebra operations) are at the core of most current mathematical applications. You can look at a polynomial equation on n-th with a single variable, or you can define an equivalent first-order (linear) system of n equations by defining auxiliary variables. Similarly, in classical mechanics you can either use second-order Lagrange equations or choose canonical coordinates that lead to (first-order) Hamiltonian formulation http://en.wikipedia.org/wiki/Hamiltonian_mechanics
Finally, a note on the steady-state vs. transient solutions of Markov problems. An overwhelming amount of practical applications (e.g., Page rank) relies on finding steady-state solutions. Indeed, the presence of such convergence to a steady state was the original motivation for A. Markov for creating his chains in an effort to extend the application of central limit theorem to dependent variables. The transient effects (such as hitting times) of Markov processes are significantly less studied and more obscure. However, it is perfectly valid to consider Markov prediction of the outcomes at the specific point in the future (and not only the converged, “equilibrium” solution).

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.

Why can't we apply Dijkstra's algorithm for a graph with negative weights?

Why can't we apply Dijkstra's algorithm for a graph with negative weights?
What does it mean to find the least expensive path from A to B, if every time you travel from C to D you get paid?
If there is a negative weight between two nodes, the "shortest path" is to loop backwards and forwards between those two nodes forever. The more hops, the "shorter" the path gets.
This is nothing to do with the algorithm, and all to do with the impossibility of answering such a question.
Edit:
The above claim assumes bidirectional links. If there is no cycles which have an overall negative weight, you do not have a way to loop around forever, being paid.
In such a case, Dijkstra's algorithm may still fail:
Consider two paths:
an optimal path that racks up a cost of 100, before crossing the final edge which has a -25 weight, giving a total of 75, and
a suboptimal path that has no negatively-weighted edges with a total cost of 90.
Dijkstra's algorithm will investigate the suboptimal path first, and will declare itself finished when it finds it. It will never follow up the subpath that is worse than the first solution found
I will give you an counterexample. Consider following graph
http://img853.imageshack.us/img853/7318/3fk8.png
Suppose you begun in vertex A and you want shortest path to D. Dijkstra's algorithm would do following steps:
Mark A as visited and add vertices B and C to queue
Fetch from queue vertex with minimal distance. It is B
Mark B as visited and add vertex D to queue.
Fetch from queue. Not it is vertex D.
Mark D as visited
Dijkstra says shortest path from A to D has length 2 but it is obviously not true.
Imagine you had a directed graph in it with a directed cycle, and the total "distance" around that was a negative weight. If on your way from the Start to the End vertex you could pass through that directed cycle, you could simply go around and around the directed cycle an arbitrary number of times.
And that means you could make you path across the graph have an infinitely negative distance (or effectively so).
However, as long as there are no directed cycles around your graph, you could get away with using Dijkstra's Algorithm without anything exploding on you.
All that being said, there if you have a graph with negative weights, you could use the Belman-Ford algorithm. Because of the generality of this algorithm, however, it is a bit slower. The Bellman-Ford algorithm takes O(V·E), where the Dijkstra's takes O(E + VlogV) time