How do Markov Chains work and what is memorylessness? - markov-chains

How do Markov Chains work? I have read wikipedia for Markov Chain, But the thing I don't get is memorylessness. Memorylessness states that:
The next state depends only on the current state and not on the
sequence of events that preceded it.
If Markov Chain has this kind of property, then what is the use of chain in markov model?
Explain this property.

You can visualize Markov chains like a frog hopping from lily pad to lily pad on a pond. The frog does not remember which lily pad(s) it has previously visited. It also has a given probability for leaping from lily pad Ai to lily pad Aj, for all possible combinations of i and j. The Markov chain allows you to calculate the probability of the frog being on a certain lily pad at any given moment.
If the frog was a vegetarian and nibbled on the lily pad each time it landed on it, then the probability of it landing on lily pad Ai from lily pad Aj would also depend on how many times Ai was visited previously. Then, you would not be able to use a Markov chain to model the behavior and thus predict the location of the frog.

The idea of memorylessness is fundamental to the success of Markov chains. It does not mean that we don't care about the past. On contrary, it means that we retain only the most relevant information from the past for predicting the future and use that information to define the present.
This nice article provides a good background on the subject
http://www.americanscientist.org/issues/pub/first-links-in-the-markov-chain
There is a trade-off between the accuracy of your description of the past and the size of the associated state space. Say, there are three pubs in the neighborhood and every evening you choose one. If you choose those pubs randomly, this is not a Markov chain (or a trivial, zero-order one) – the outcome is random. More precisely, it is an independent random variable (modeling dependency was fundamental to Markov ideas underlying Markov chains).
In your choice of pubs you can factor in your last choice, i.e., which pub you went to the night before. For example, you might want to avoid going to the same pub two days in a row. While in reality this implies remembering where you have been yesterday (and thus remembering the past!), at your modeling level, your unit of time is one day, so your current state is the pub you went to yesterday. This is your classical (first-order) Markov model with three states and 3 by 3 transition matrix that provides conditional probabilities for each permutation (if yesterday you went to pub I, what is the change that today you “hop” to pub J).
However, you can define a model where you “remember” the last two days. In this second-order Markov model “present” state will include the knowledge of the pub from last night and from the night before. Now you have 9 possible states describing your present state, and therefore you have a 9 by 9 transition matrix. Thankfully, this matrix is not fully populated.
To understand why, consider a slightly different setup when you are so well-organized that you make firm plans for your pub choices both for today and tomorrow based on the last two visits. Then, you can select any possible permutations of pubs visited for two days in a row. The result is a fully populated 9 by 9 matrix that maps your choices for the last two days into those for the next two days. However, in our original problem we make the decision every day, so our future state is constrained by what happened today: at the next time step (tomorrow) today becomes yesterday, but it will be still a part of your definition of "today" at that time step, and relevant to what happens the following day. The situation is analogous to moving averages, or receding horizon procedures. As a result, from a given state, you can only move to three possible states (indicating your today’s choice of pubs), which means that each row of your transition matrix will have only three non-zero entries.
Let us tally up the number of parameters that characterize each problem: the zero- order Markov model with three states has two independent parameters (probabilities of hitting the first and the second pub, as the probability of visiting the third pub is the complement of the first two). The first-order Markov model has a fully populated 3 by 3 matrix with each column summing up to one (again, indicating that one of the pubs will always be visited at any given day), so we end up with six independent parameters. The second-order Markov model has 9 by 9 matrix with each row having only 3 non-zero entries and all columns adding to one, so we have 18 independent parameters.
We can continue defining higher-order models, and our state space will grow accordingly.
Importantly, we can further refine the concept by identifying important features of the past and use only those features to define the present, i.e. compress the information about the past. This is what I referred to in the beginning. For example, instead of remembering all history we can only keep track only of some memorable events that impact our choice, and use this “sufficient statistics” to construct the model.
It all boils down to the way you define relevant variables (state space), and the Markov concepts naturally follow from the underlying fundamental mathematics concepts. First-order (linear) relationships (and associated linear algebra operations) are at the core of most current mathematical applications. You can look at a polynomial equation on n-th with a single variable, or you can define an equivalent first-order (linear) system of n equations by defining auxiliary variables. Similarly, in classical mechanics you can either use second-order Lagrange equations or choose canonical coordinates that lead to (first-order) Hamiltonian formulation http://en.wikipedia.org/wiki/Hamiltonian_mechanics
Finally, a note on the steady-state vs. transient solutions of Markov problems. An overwhelming amount of practical applications (e.g., Page rank) relies on finding steady-state solutions. Indeed, the presence of such convergence to a steady state was the original motivation for A. Markov for creating his chains in an effort to extend the application of central limit theorem to dependent variables. The transient effects (such as hitting times) of Markov processes are significantly less studied and more obscure. However, it is perfectly valid to consider Markov prediction of the outcomes at the specific point in the future (and not only the converged, “equilibrium” solution).

Related

Is PhysX a good match for running lots of similar, short simulations?

I want to use a simplified model of the human body plus some rigid attachments in the prediction portion of an Unscented Kalman Filter. In other words, I will have a few thousand candidate sets of parameters (joint positions, velocities, muscle tensions, etc.), and I will simulate one short time step with each. Then I will take the resulting parameters at the end of the time step and do some linear algebra after adding some information from my sensors. The algebra will generate a new group of parameter sets, allowing me to repeat the process.
The elements of each candidate parameter set will be similar. (They will be points on the surface of a hyperellipsoid aligned with its axes plus the hyperellipsoid's centroid. Or, to put it another way, they will be the mean and the mean +/- N standard deviations of a high-dimensional Gaussian.) But they won't have any other relation to one another.
I'm thinking of using PhysX, but after reading the introductory docs, I can't tell whether it will be a good fit for my problem. Is the simulation portion above an appropriate workload for PhysX?

Is it better to have 1 or 10 output neurons?

Is it better to have:
1 output neuron that outputs a value between 0 and 15 which would be my ultimate value
or
16 output neurons that output a value between 0 and 1 which represents the propability for this value?
Example: We want to find out the grade (ranging from 0 to 15) a student gets by inputing the number of hours he learned and his IQ.
TL;DR: I think your problem would be better framed as a regression task, so use one ouptut neuron, but it is worth to try both.
I don't quite like the broadness of your question in contrast to the very specific answers, so I am going to go a little deeper and explain what exactly should be the proper formulation.
Before we start, we should clarify the two big tasks that classical Artificial Neural Networks perform:
Classification
Regression
They are inherently very different from one another; in short, Classification tries to put a label on your input (e.g., the input image shows a dog), whereas regression tries to predict a numerical value (e.g., the input data corresponds to a house that has an estimated worth of 1.5 million $US).
Obviously, you can see that predicting the numerical value requires (trivially) only one output value. Also note that this is only true for this specific example. There could be other regression usecases, in which you want your output to have more than 0 dimensions (i.e. a single point), but instead be 1D, or 2D.
A common example would for example be Image Colorization, which we can interestingly enough also frame as a classification problem. The provided link shows examples for both. In this case you would obviously have to regress (or classify) every pixel, which leads to more than one output neuron.
Now, to get to your actual question, I want to elaborate a little more on the reasoning why one-hot encoded outputs (i.e. output with as many channels as classes) are preferred for classification tasks over a single neuron.
Since we could argue that a single neuron is enough to predict the class value, we have to understand why it is problematic to get to a specific class that way.
Categorical vs Ordinal vs Interval Variables
One of the main problems is the type of your variable. In your case, there exists a clear order (15 is better than 14 is better than 13, etc.), and even an interval ordering (at least on paper), since the difference between a 15 and 13 is the same as between 14 and 12, although some scholars might argue against that ;-)
Thus, your target is an interval variable, and could thus be in theory used to regress on it. More on that later. But consider for example a variable that describes whether the image depicts a cat (0), dog (1), or car (2). Now, arguably, we cannot even order the variables (is a car > dog, or car < dog?), nor can we say that there exists an "equal distance" between a cat and a dog (similar, since both are animals?) or a cat and a car (arguably more different from each other). Thus, it becomes really hard to interpret a single output value of the network. Say an input image results in the output of, say, 1.4.
Does this now still correspond to a dog, or is this closer to a car? But what if the image actually depicts a car that has properties of a cat?
On the other hand, having 3 separate neurons that reflect the different probabilities of each class eliminate that problem, since each one can depict a relatively "undisturbed" probability.
How to Loss Function
The other problem is the question how to backpropagate through the network in the previous example. Classically, classification tasks make use of Cross-Entropy Loss (CE), whereas regression uses Mean Squared Error (MSE) as a measure. Those two are inherently different, and especially the combination of CE and Softmax lead to very convenient (and stable) derivations.
Arguably, you could apply rounding to get from 1.4 to a concise class value (in that case, 1) and then use CE loss, but that would maybe lead to numerically instability; MSE on the other hand will never give you a "clear class value", but more a regressed estimate.
In the end, the question boils down to: Do I have a classification or regression problem. In your case, I would argue that both approaches could work reasonably well. A (classification) network might not recognize the correlation between the different output classes; i.e. a student that has a high likelihood for class 14 basically has zero probability of scoring a 3 or lower. On the other hand, regression might not be able to accurately predict the results for other reasons.
If you have the time, I would highly encourage you to try both approaches. For now, considering the interval type of your target, I would personally go with a regression task, and use rounding after you have trained your network and can make accurate predictions.
It is better to have a single neuron for each class (except binary classification). This allows for better design in terms of expanding upon an existing design. A simple example is creating a network for recognizing digits 0 through 9, but then changing the design to hex from 0 through F.

LSTM NN: forward propagation

I am new to neural nets, and am creating a LSTM from scratch. I have the forward propagation working...but I have a few questions about the moving pieces in forward propagation in the context of a trained model, back propagation, and memory management.
So, right now, when I run forward propagation, I stack the new columns, f_t, i_t, C_t, h_t, etc on their corresponding arrays as I accumulate previous positions for the bptt gradient calculations.
My question is 4 part:
1) How far back in time do I need to back propagate in order to retain reasonably long-term memories? (memory stretching back 20-40 time steps is probably what I need for my system (although I could benefit from a much longer time period--that is just the minimum for decent performance--and I'm only shooting for the minimum right now, so I can get it working)
2) Once I consider my model "trained," is there any reason for me to keep more than the 2 time-steps I need to calculate the next C and h values? (where C_t is the Cell state, and h_t is the final output of the LSTM net) in which case I would need multiple versions of the forward propagation function
3) If I have limited time series data on which to train, and I want to train my model, will the performance of my model converge as I train it on the training data over and over (as versus oscillate around some maximal average performance)? And will it converge if I implement dropout?
4) How many components of the gradient do I need to consider? When I calculate the gradient of the various matrices, I get a primary contribution at time step t, and secondary contributions from time step t-1 (and the calculation recurses all the way back to t=0)? (in other words: does the primary contribution dominate the gradient calculation--will the slope change due to the secondary components enough to warrant implementing the recursion as I back propagate time steps...)
As you have observed, it depends on the dependencies in the data. But LSTM can learn to learn longer term dependencies even though we back propagate only a few time steps if we do not reset the cell and hidden states.
No. Given c_t and h_t, you can determine c and h for the next time step. Since you don't need to back propagate, you can throw away c_t (and even h_t if you are just interested in the final LSTM output)
You might converge if you start over fitting. Using Dropout will definitely help avoiding that, especially along with early stopping.
There will be 2 components of gradient for h_t - one for current output and one from the next time step. Once you add the both, you won't have to worry about any other components

Training HMM - The amount of data required

I'm using HMM for classifications. I came cross an example in Wikipedia Baum–Welch algorithm Example. Hope someone can help me.
The example as follow: "Suppose we have a chicken from which we collect eggs at noon everyday. Now whether or not the chicken has laid eggs for collection depends on some unknown factors that are hidden. We can however (for simplicity) assume that there are only two states that determine whether the chicken lays eggs."
Note that we have 2 different observations (N and E) and 2 states (S1 and S2) in this example.
My question here is:
How many observations/Observed sequences (or training data) do we need to best train the model. Is there any way to estimate or to test the amount of training data required.
For each variable in your HMM model, you need about 10 samples. Using this rule of thumb, you can easily calculate how many samples do you need to construct a reliable classifier.
In your example you have two states which results in a 2 in 2 transition matrix A=[a_00, a_01;a_10, a_11] where a_ij is the transition probability from state S_i to S_j.
Moreover, each of these states with probability p_S1 and p_S2 generate observations, i.e.: If we are at state S1 with probability p_S1 the chicken will lay egg and with probability 1-p_S1 it will not.
In total you have 6 variables needed to be estimate. It is more or less obvious that it is not possible to accurately estimate them from only two observations. As I mentioned before, it is conventional to assume at least 10 samples per variable are needed in order to estimate that variable accurately.

Matlab Hidden Markov Model Data Prediction

I am new to Hidden Markov Models (HMM) and I am now experimenting with it for data prediction. Consider a sinusoidal wave which has been sampled at non-uniform intervals and I would like to use these data to predict the output at a future instant of time. I am trying to use the statistical toolbox with matlab.
The problem seems to be that in the examples given, I would need an emission matrix and a transition matrix to even generate a hmm model. But based on just the data I have, how do i evaluate these matrices?? and How do i train the model based on the data I have?
I second slayton's answer.
The transition matrix is simply the list of probabilities that one state will go to another.
A hidden Markov model assumes you can't actually see what the state of the system is (it's hidden). For example, suppose your neighbor has a dog. The dog may be hungry or full, this is the dog's state. You can't ask the dog if it's hungry, and you can't look inside its stomach, so the state is hidden from you (since you only glance outside, at the dog, briefly each day you can't keep track of when it runs inside to eat or and how much it ate if so).
You know, however, that after it ate and became full, it will become hungry again after some time (depending on how much it ate last, but you don't know that so it might as well be random) and when it is hungry, it will eventually run inside and eat (sometimes it will sit outside out of laziness despite being hungry).
Given this system, you cannot see when the dog is hungry and when it is not. However, you can infer it from whether the dog whines. If it's whining, it's probably hungry. If it's happily barking, it's probably full. But just because it's whining doesn't mean it's hungry (maybe its leg hurts) and just the bark doesn't mean full (maybe it was hungry but got excited at something). However, usually a bark comes when it's full, and a whine comes when it's hungry. It may also make no sound at all, telling you nothing about its state.
So this is the emission matrix. The "hungry" state is more likely to "emit a whine", ditto for full and barks. The emission matrix says what you will observe in each given state.
If you use a square identity matrix for your emission matrix, then each state will always emit itself, and you will end up with non-hidden Markov model.
The matlab docs do a great job describing how to use the statistical toolbox functions for HMMs . The section "Estimating Transition and Emission Matrices" will probably get you pointed in the right direction.