Matlab Hidden Markov Model Data Prediction - matlab

I am new to Hidden Markov Models (HMM) and I am now experimenting with it for data prediction. Consider a sinusoidal wave which has been sampled at non-uniform intervals and I would like to use these data to predict the output at a future instant of time. I am trying to use the statistical toolbox with matlab.
The problem seems to be that in the examples given, I would need an emission matrix and a transition matrix to even generate a hmm model. But based on just the data I have, how do i evaluate these matrices?? and How do i train the model based on the data I have?

I second slayton's answer.
The transition matrix is simply the list of probabilities that one state will go to another.
A hidden Markov model assumes you can't actually see what the state of the system is (it's hidden). For example, suppose your neighbor has a dog. The dog may be hungry or full, this is the dog's state. You can't ask the dog if it's hungry, and you can't look inside its stomach, so the state is hidden from you (since you only glance outside, at the dog, briefly each day you can't keep track of when it runs inside to eat or and how much it ate if so).
You know, however, that after it ate and became full, it will become hungry again after some time (depending on how much it ate last, but you don't know that so it might as well be random) and when it is hungry, it will eventually run inside and eat (sometimes it will sit outside out of laziness despite being hungry).
Given this system, you cannot see when the dog is hungry and when it is not. However, you can infer it from whether the dog whines. If it's whining, it's probably hungry. If it's happily barking, it's probably full. But just because it's whining doesn't mean it's hungry (maybe its leg hurts) and just the bark doesn't mean full (maybe it was hungry but got excited at something). However, usually a bark comes when it's full, and a whine comes when it's hungry. It may also make no sound at all, telling you nothing about its state.
So this is the emission matrix. The "hungry" state is more likely to "emit a whine", ditto for full and barks. The emission matrix says what you will observe in each given state.
If you use a square identity matrix for your emission matrix, then each state will always emit itself, and you will end up with non-hidden Markov model.

The matlab docs do a great job describing how to use the statistical toolbox functions for HMMs . The section "Estimating Transition and Emission Matrices" will probably get you pointed in the right direction.

Related

Is it better to have 1 or 10 output neurons?

Is it better to have:
1 output neuron that outputs a value between 0 and 15 which would be my ultimate value
or
16 output neurons that output a value between 0 and 1 which represents the propability for this value?
Example: We want to find out the grade (ranging from 0 to 15) a student gets by inputing the number of hours he learned and his IQ.
TL;DR: I think your problem would be better framed as a regression task, so use one ouptut neuron, but it is worth to try both.
I don't quite like the broadness of your question in contrast to the very specific answers, so I am going to go a little deeper and explain what exactly should be the proper formulation.
Before we start, we should clarify the two big tasks that classical Artificial Neural Networks perform:
Classification
Regression
They are inherently very different from one another; in short, Classification tries to put a label on your input (e.g., the input image shows a dog), whereas regression tries to predict a numerical value (e.g., the input data corresponds to a house that has an estimated worth of 1.5 million $US).
Obviously, you can see that predicting the numerical value requires (trivially) only one output value. Also note that this is only true for this specific example. There could be other regression usecases, in which you want your output to have more than 0 dimensions (i.e. a single point), but instead be 1D, or 2D.
A common example would for example be Image Colorization, which we can interestingly enough also frame as a classification problem. The provided link shows examples for both. In this case you would obviously have to regress (or classify) every pixel, which leads to more than one output neuron.
Now, to get to your actual question, I want to elaborate a little more on the reasoning why one-hot encoded outputs (i.e. output with as many channels as classes) are preferred for classification tasks over a single neuron.
Since we could argue that a single neuron is enough to predict the class value, we have to understand why it is problematic to get to a specific class that way.
Categorical vs Ordinal vs Interval Variables
One of the main problems is the type of your variable. In your case, there exists a clear order (15 is better than 14 is better than 13, etc.), and even an interval ordering (at least on paper), since the difference between a 15 and 13 is the same as between 14 and 12, although some scholars might argue against that ;-)
Thus, your target is an interval variable, and could thus be in theory used to regress on it. More on that later. But consider for example a variable that describes whether the image depicts a cat (0), dog (1), or car (2). Now, arguably, we cannot even order the variables (is a car > dog, or car < dog?), nor can we say that there exists an "equal distance" between a cat and a dog (similar, since both are animals?) or a cat and a car (arguably more different from each other). Thus, it becomes really hard to interpret a single output value of the network. Say an input image results in the output of, say, 1.4.
Does this now still correspond to a dog, or is this closer to a car? But what if the image actually depicts a car that has properties of a cat?
On the other hand, having 3 separate neurons that reflect the different probabilities of each class eliminate that problem, since each one can depict a relatively "undisturbed" probability.
How to Loss Function
The other problem is the question how to backpropagate through the network in the previous example. Classically, classification tasks make use of Cross-Entropy Loss (CE), whereas regression uses Mean Squared Error (MSE) as a measure. Those two are inherently different, and especially the combination of CE and Softmax lead to very convenient (and stable) derivations.
Arguably, you could apply rounding to get from 1.4 to a concise class value (in that case, 1) and then use CE loss, but that would maybe lead to numerically instability; MSE on the other hand will never give you a "clear class value", but more a regressed estimate.
In the end, the question boils down to: Do I have a classification or regression problem. In your case, I would argue that both approaches could work reasonably well. A (classification) network might not recognize the correlation between the different output classes; i.e. a student that has a high likelihood for class 14 basically has zero probability of scoring a 3 or lower. On the other hand, regression might not be able to accurately predict the results for other reasons.
If you have the time, I would highly encourage you to try both approaches. For now, considering the interval type of your target, I would personally go with a regression task, and use rounding after you have trained your network and can make accurate predictions.
It is better to have a single neuron for each class (except binary classification). This allows for better design in terms of expanding upon an existing design. A simple example is creating a network for recognizing digits 0 through 9, but then changing the design to hex from 0 through F.

Understanding relation between Neural Networks and Hidden Markov Model

I've red a few paper about speech recognition based on neural networks, the gaussian mixture model and the hidden markov model. On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details. I really would appreciate, if someone could enlighten me.
As I understand it, the procedure consists of three elements:
Input
The audio stream gets split up by frames of 10ms and processed by a MFCC, which outputs a feature vector.
DNN The neural network gets the feature vector as a input and processes the features, so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.
HMM
The HMM is a is a state model, in which each state represents a tri-phone. Each state has a number of probability for changing to all the other state.
Now the output layer of the DNN produces a feature vector, that tells the current state to which state it has to change next.
What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state. And how is the HMM created in the first place? Where do I get all the Information about the probabilietes?
I don't need to understand every detail, the basic concept is sufficient for my purpose. I just need to assure, that my basic thinking about the process is right.
On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details.
It is better to read a textbook, not a research paper.
so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.
This sentence does not have clear meaning which means you are not quite sure yourself. DNN takes a frame features and produces the probabilities for the states.
HMM The HMM is a is a state model, in which each state represents a tri-phone.
Not necessary a triphone. Usually there are tied triphones which means several triphones correspond to certain state.
Now the output layer of the DNN produces a feature vector
No, DNN produces state probabilities for the current frame, it does not produce feature vector.
that tells the current state to which state it has to change next.
No, next state is selected by HMM Viterbi algorithm based on current state and DNN probabilities. DNN alone does not decide the next state.
What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state.
Output layer produces probabilities. It says that phone A at this frame is probable with probability 0.9 and phone B in this frame is probable with probability 0.1
And how is the HMM created in the first place?
Unlike end-to-end systems which does not use HMM, HMM is usually trained with HMM/GMM system and Baum-Welch algorithm before DNN is initialized. So you first train GMM/HMM with Baum-Welch, then you train the DNN to improve GMM.
Where do I get all the Information about the probabilietes?
It is hard to understand your last question.

State representation for grid world

I'm new to reinforcement learning and q-learning and I'm trying to understand concepts and try to implement them. Most of material I have found use CNN layers to process image input. I think I would rather start with something simpler than than, so I use grid world.
This is what I have already implemented. I implemented an environment by following MDP and have 5x5 grid, with fixed agent position (A) and target position (T). Start state could look like this.
-----
---T-
-----
-----
A----
Currently I represent my state as a 1-dimensional vector of length 25 (5x5) where 1 is on position where Agent is, otherwise 0, so for example
the state above will be repsented as vector
[1, 0, 0, ..., 0]
I have successfully implemented solutions with Q table and simple NN with no hidden layer.
Now, I want to move little further and make the task more complicated by making Target position random each episode. Because now there is no correlation between my current representation of state and actions, my agents act randomly. In order to solve my problem, first I need to adjust my state representation to contain some information like distance to the target, direction or both. The problem is, that I don't how to represent my state now. I have come up with some ideas:
[x, y, distance_T]
[distance_T]
two 5x5 vectors, one for Agent's position, one for Target's position
[1, 0, 0, ..., 0], [0, 0, ..., 1, 0, ..., 0]
I know that even if I will figure out the state representation, my implemented model will not be able to solve the problem and I will need to move toward hidden layers, experience replay, frozen target network and so on, but I only want to verify the model failure.
In conclusion, I want to ask how to represent such state as an input for neural network. If there are any sources of informations, articles, papers etc. which I have missed, feel free to post them.
Thank you in advance.
In Reinforcement Learning there is no right state representation. But there are wrong state representations. At least, that is to say that Q-learning and other RL techniques make a certain assumption about the state representation.
It is assumed that the states are states of a Markov Decision Process (MDP). An MDP is one where everything you need to know to 'predict' (even in a probabilistic way) is available in the current state. That is to say that the agent must not need memory of past states to make a decision.
It is very rarely the case in real life that you have a Markov decision process. But many times you have something close, which has been empirically shown to be enough for RL algorithms.
As a "state designer" you want to create a state that makes your task as close as possible to an MDP. In your specific case, if you have the distance as your state there is very little information to predict the next state, that is the next distance. Some thing like the current distance, the previous distance and the previous action is a better state, as it gives you a sense of direction. You could also make your state be the distance and the direction to which the target is at.
Your last suggestion of two matrices is the one I like most. Because it describes the whole state of the task without giving away the actual goal of the task. It also maps well to convolutional networks.
The distance approach will probably converge faster, but I consider it a bit like cheating because you practically tell the agent what it needs to look for. In more complicated cases this will rarely be possible.
Your last suggestion is the most general way to represent states as an input for function approximators, especially for Neural Networks. By that representation, you can also add more dimensions that will stand for non-accessible blocks and even other agents. So, you generalize the representation and might apply it to other RL domains. You will also have the chance to try Convolutional NNs for bigger grids.

Why do we need stochasticity in deterministic simulations?

Assuming the world were deterministic, why would we still need to introduce stochasticity into our simulations?
In a nutshell, to simplify models.
Let’s go with your assumption, even though I don’t believe it. If the universe is completely deterministic, then in any given scenario you choose to model there is one and only one correct answer. Unless you include the complete state space of absolutely everything that determines that answer, your model is wrong. Wrong, wrong, wrong!!!
For instance, if you want to predict how long it will take to fly from New York to London, you need to know the vector sums of all forces acting on the aircraft, which means you need the complete state (down to the atomic level) of the aircraft itself, the passengers, the atmosphere, fluctuations in the magnetic field of the earth, cosmic rays that can trigger upper atmospheric events, etc, etc, ad nauseam. Exclusion of any aspect of the potential forces involved makes your answer wrong.
Clearly, there’s no way to measure it all, and even if there was, there’s no way to maintain so much state information in any computing device we can build. And so we simplify and acknowledge that there is some degree of uncertainty in our model’s predictions/solutions.
When you embrace the existence of uncertainty, it brings us directly to stochastic solutions. One view of probability is that it is a mathematical formalism for modeling uncertainty. Rather than try to model every physical aspect of an aircraft’s flight, we can characterize the likely outcomes based on what proportion of flights require less (or more) than any particular amount of time, i.e., describing the distribution of possible flight times.
Once you adopt distributional modeling, you can see how distributional behaviors propagate though other parts of a system—either analytically, if your system is sufficiently simple, or by generating realizations of the distributions and using replication and sampling via simulation.

How do Markov Chains work and what is memorylessness?

How do Markov Chains work? I have read wikipedia for Markov Chain, But the thing I don't get is memorylessness. Memorylessness states that:
The next state depends only on the current state and not on the
sequence of events that preceded it.
If Markov Chain has this kind of property, then what is the use of chain in markov model?
Explain this property.
You can visualize Markov chains like a frog hopping from lily pad to lily pad on a pond. The frog does not remember which lily pad(s) it has previously visited. It also has a given probability for leaping from lily pad Ai to lily pad Aj, for all possible combinations of i and j. The Markov chain allows you to calculate the probability of the frog being on a certain lily pad at any given moment.
If the frog was a vegetarian and nibbled on the lily pad each time it landed on it, then the probability of it landing on lily pad Ai from lily pad Aj would also depend on how many times Ai was visited previously. Then, you would not be able to use a Markov chain to model the behavior and thus predict the location of the frog.
The idea of memorylessness is fundamental to the success of Markov chains. It does not mean that we don't care about the past. On contrary, it means that we retain only the most relevant information from the past for predicting the future and use that information to define the present.
This nice article provides a good background on the subject
http://www.americanscientist.org/issues/pub/first-links-in-the-markov-chain
There is a trade-off between the accuracy of your description of the past and the size of the associated state space. Say, there are three pubs in the neighborhood and every evening you choose one. If you choose those pubs randomly, this is not a Markov chain (or a trivial, zero-order one) – the outcome is random. More precisely, it is an independent random variable (modeling dependency was fundamental to Markov ideas underlying Markov chains).
In your choice of pubs you can factor in your last choice, i.e., which pub you went to the night before. For example, you might want to avoid going to the same pub two days in a row. While in reality this implies remembering where you have been yesterday (and thus remembering the past!), at your modeling level, your unit of time is one day, so your current state is the pub you went to yesterday. This is your classical (first-order) Markov model with three states and 3 by 3 transition matrix that provides conditional probabilities for each permutation (if yesterday you went to pub I, what is the change that today you “hop” to pub J).
However, you can define a model where you “remember” the last two days. In this second-order Markov model “present” state will include the knowledge of the pub from last night and from the night before. Now you have 9 possible states describing your present state, and therefore you have a 9 by 9 transition matrix. Thankfully, this matrix is not fully populated.
To understand why, consider a slightly different setup when you are so well-organized that you make firm plans for your pub choices both for today and tomorrow based on the last two visits. Then, you can select any possible permutations of pubs visited for two days in a row. The result is a fully populated 9 by 9 matrix that maps your choices for the last two days into those for the next two days. However, in our original problem we make the decision every day, so our future state is constrained by what happened today: at the next time step (tomorrow) today becomes yesterday, but it will be still a part of your definition of "today" at that time step, and relevant to what happens the following day. The situation is analogous to moving averages, or receding horizon procedures. As a result, from a given state, you can only move to three possible states (indicating your today’s choice of pubs), which means that each row of your transition matrix will have only three non-zero entries.
Let us tally up the number of parameters that characterize each problem: the zero- order Markov model with three states has two independent parameters (probabilities of hitting the first and the second pub, as the probability of visiting the third pub is the complement of the first two). The first-order Markov model has a fully populated 3 by 3 matrix with each column summing up to one (again, indicating that one of the pubs will always be visited at any given day), so we end up with six independent parameters. The second-order Markov model has 9 by 9 matrix with each row having only 3 non-zero entries and all columns adding to one, so we have 18 independent parameters.
We can continue defining higher-order models, and our state space will grow accordingly.
Importantly, we can further refine the concept by identifying important features of the past and use only those features to define the present, i.e. compress the information about the past. This is what I referred to in the beginning. For example, instead of remembering all history we can only keep track only of some memorable events that impact our choice, and use this “sufficient statistics” to construct the model.
It all boils down to the way you define relevant variables (state space), and the Markov concepts naturally follow from the underlying fundamental mathematics concepts. First-order (linear) relationships (and associated linear algebra operations) are at the core of most current mathematical applications. You can look at a polynomial equation on n-th with a single variable, or you can define an equivalent first-order (linear) system of n equations by defining auxiliary variables. Similarly, in classical mechanics you can either use second-order Lagrange equations or choose canonical coordinates that lead to (first-order) Hamiltonian formulation http://en.wikipedia.org/wiki/Hamiltonian_mechanics
Finally, a note on the steady-state vs. transient solutions of Markov problems. An overwhelming amount of practical applications (e.g., Page rank) relies on finding steady-state solutions. Indeed, the presence of such convergence to a steady state was the original motivation for A. Markov for creating his chains in an effort to extend the application of central limit theorem to dependent variables. The transient effects (such as hitting times) of Markov processes are significantly less studied and more obscure. However, it is perfectly valid to consider Markov prediction of the outcomes at the specific point in the future (and not only the converged, “equilibrium” solution).