Basic Hidden Markov Model, Viterbi algorithm - classification

I am fairly new to Hidden Markov Models and I am trying to wrap my head around a pretty basic part of the theory.
I would like to use a HMM as a classifier, so, given a time series of data I have two classes: background and signal.
How are the emission probabilities estimated for each class? Does the Viterbi algorithm need a template of the background and signal to estimate prob(data|state)? Or have I completely missed the point?

To do classification with Viterbi you need to already know the model parameters.
Background and Signal are your two hidden states. With the model parameters and observed data you want to use Viterbi to calculate the most likely sequence of hidden states.
To quote the hmmlearn documentation:
The HMM is a generative probabilistic model, in which a sequence of
observable X variables is generated by a sequence of internal hidden
states Z. The hidden states are not be observed directly. The
transitions between hidden states are assumed to have the form of a
(first-order) Markov chain. They can be specified by the start
probability vector π and a transition probability matrix A. The
emission probability of an observable can be any distribution with
parameters θ conditioned on the current hidden state. The HMM is
completely determined by π, A and θ
.
There are three fundamental problems for HMMs:
Given the model parameters and observed data, estimate the optimal sequence of hidden states.
Given the model parameters and observed data, calculate the likelihood of the data.
Given just the observed data, estimate the model parameters.
The first and the second problem can be solved by the dynamic
programming algorithms known as the Viterbi algorithm and the
Forward-Backward algorithm, respectively. The last one can be solved
by an iterative Expectation-Maximization (EM) algorithm, known as the
Baum-Welch algorithm.

So we've got two states for our Hidden Markov model, noise and signal. We also must have something we observe, which could be ones and zeroes. Basically, ones are signal and zeroes are noise, but you get a few zeros mixed in with your signal and a few ones with the noise. So you need to know
Probablity of 0,1 when in the "noise" state
Probability of 0,1 when in the "signal" state
Probability of transitioning to "signal" when in the "noise" state.
Probability of transitioning to "noise" when in the "signal" state.
So we keep track of the probability of each state for each time slot and, crucially, the most likely route we got there (based on the transition probabilities). Then we assume that the most probable state at the end of the time series is our actual final state, and trace backwards.

The Viterbi Algorithm requires to know the HMM.
The HMM can be estimated with a Maximum-Likely-Estimation (MLE) and it's called the Baum–Welch algorithm.
If you have trouble with the Viterbi Algorithm there's a working implementation here

Related

In Bayesian simulation, why use fixed value for the parameters, which has a prior, in the model?

When simulating a Bayesian model, are we supposed to treat the parameters as a random variable (a prior), but not use a fixed value?
For example, we have a Bayesian linear model y=x\beta+\epsilon. When simulating it, literature usually: 1. set regression coefficients at fixed values, e.g. (0,3,-2,1,0,...); 2. simulate the predictors many times; 3. simulate the error term, usually standard normal; 4. generate the response.
If the regression coefficients have a prior (assume they have exchangeable priors), and thus we have posterior distributions, why would we simulate only one set of regression coefficients values? This sounds like the posterior has a distribution, meaning that we don't believe in any fixed value, while the truth indeed is fixed value. Even the posterior mean is supposed to converge to the OLS estimate under good setups, but this still feels difficult to understand.

Kalman Filter : How measurement noise covariance matrix and process noise helps in working of kalman filter , can some one explain intuitively?

How process noise covariance and measurement noise covariance are helping better functioning of Kalman filter ?
Can someone explain intuitively without significant equations and math please.
Well, its difficult to explain mathematical things (like kalman filters) without mathematics, but here's my attempt:
There are two parts to a kalman filter, a time update part and a measurement part. In the time update part we estimate the state at the time of observation; in the measurement part we combine (via least squares) our 'predictions' (ie the estimate from the time update) with the measurements to get a new estimate of the state.
So far, no mention of noise. There are two sources of noise: one in the time update part (sometimes called process noise) and one in the measurement part (observation noise). In each case what we need is a measure of the 'size' of that noise, ie the covariance matrix. These are used when we combine the
predictions with the measurements. When we view our predictions as very uncertain (that is, they have a large covariance matrix) the combination will be closer to the measurements than to the predictions; on the other hand when we view our predictions as very good (small covariance) the combination will be closer to the predictions than to the measurements.
So you could look upon the process and observation noise covariances as saying how much to trust the (parts of) the predictions and observations. Increasing, say, the variance of a particular component of the predictions is to say: trust this prediction less; while increasing the variance of a particular measurement is to say: trust this measurement less. This is mostly an analogy but it can be made more precise. A simple case is when the covariance matrices are diagonal. In that case the cost, ie the contrinution to what we are trying to minimise, of a difference between an measurement and the computed value is te square of that difference, divided by the observations variance. So the higher an observations variance, the lower the cost.
Note that out of the measurement part we also get a new state covariance matrix; this is used (along with the process noise and the dynamics) in the next time update when we compute the predicted state covariance.
I think the question of why the covariance is the appropriate measure of the size of the noise is rather a deep one, as is why least squares is the appropriate way to combine the predictions and the measurements. The shallow answer is that kalman filtering and least squares have been found, over decades (centuries in the case of least squares), to work well in many application areas. In the case of kalman filtering I find the derivation of it from hidden markobv models (From Hidden Markov Models to Linear Dynamical Systems by T.Minka, though this is rather mathematical) convincing. In Hidden markov models we seek to find the (conditional) probability of the states given the measurements so far; Minka shows that if the measurements are linear functions of the states and the dynamics are linear and all probability distributions are Gaussian, then we get the kalman filter.

Understanding relation between Neural Networks and Hidden Markov Model

I've red a few paper about speech recognition based on neural networks, the gaussian mixture model and the hidden markov model. On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details. I really would appreciate, if someone could enlighten me.
As I understand it, the procedure consists of three elements:
Input
The audio stream gets split up by frames of 10ms and processed by a MFCC, which outputs a feature vector.
DNN The neural network gets the feature vector as a input and processes the features, so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.
HMM
The HMM is a is a state model, in which each state represents a tri-phone. Each state has a number of probability for changing to all the other state.
Now the output layer of the DNN produces a feature vector, that tells the current state to which state it has to change next.
What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state. And how is the HMM created in the first place? Where do I get all the Information about the probabilietes?
I don't need to understand every detail, the basic concept is sufficient for my purpose. I just need to assure, that my basic thinking about the process is right.
On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details.
It is better to read a textbook, not a research paper.
so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.
This sentence does not have clear meaning which means you are not quite sure yourself. DNN takes a frame features and produces the probabilities for the states.
HMM The HMM is a is a state model, in which each state represents a tri-phone.
Not necessary a triphone. Usually there are tied triphones which means several triphones correspond to certain state.
Now the output layer of the DNN produces a feature vector
No, DNN produces state probabilities for the current frame, it does not produce feature vector.
that tells the current state to which state it has to change next.
No, next state is selected by HMM Viterbi algorithm based on current state and DNN probabilities. DNN alone does not decide the next state.
What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state.
Output layer produces probabilities. It says that phone A at this frame is probable with probability 0.9 and phone B in this frame is probable with probability 0.1
And how is the HMM created in the first place?
Unlike end-to-end systems which does not use HMM, HMM is usually trained with HMM/GMM system and Baum-Welch algorithm before DNN is initialized. So you first train GMM/HMM with Baum-Welch, then you train the DNN to improve GMM.
Where do I get all the Information about the probabilietes?
It is hard to understand your last question.

How to solve 2D Markov Chains with Infinite State Space

I have 2 dimensional markov chain and I want to calculate steady state probabilities and then basic performance measurements such as expected number of customers, expected waiting time, etc. You can check the transition rate diagram link below:
http://tinypic.com/view.php?pic=2n063dd&s=8
As I search for solution methods, matrix geometric and spectral expansion methods appear. I tried matrix geometric method, however since my Markov chain is not repetitive, it did not work.
I read some paper (e.g. Spectral expansion solution for a class of Markov models: application and comparison with the matrix-geometric method), but I could not figure out how to create matrices and what is the steady state probabilities.
Does spectral expansion method require "repetitive process" as matrix geometric method does? If no, how to apply to my problem?
Are there any other methods to compute?
Thanks for all your help!
Ali
First, there is no stable solution method for two-way infinite lattice strip. At least one variable should be capacitated.
Second, the following are the most known solution methods for two-dimensional Markov chains with semi-infinite or finite state space:
Spectral Expansion Method
Matrix Geometric Method
Block Gauss-Seidel Method
Seelen's Method
All methods require high computational work. Experimental studies show that for semi-infinite lattice strip, as the capacitated variable exceeds 50, solution may not be trustable. Also there is a state explosion problem beyond that threshold. To overcome the state explosion problem, iterative methods are used such Gauss-Seidel and Seelen's methods.
Regarding my problem, I determined capacity for both variables. After a search in the literature, block Gauss-Seidel Iterative method seems to be the most appropriate method to apply my problem.
Thank you.

What is the structure of an indirect (error-state) Kalman filter and how are the error equations derived?

I have been trying to implement a navigation system for a robot that uses an Inertial Measurement Unit (IMU) and camera observations of known landmarks in order to localise itself in its environment. I have chosen the indirect-feedback Kalman Filter (a.k.a. Error-State Kalman Filter, ESKF) to do this. I have also had some success with an Extended KF.
I have read many texts and the two I am using to implement the ESKF are "Quaternion kinematics for the error-state KF" and "A Kalman Filter-based Algorithm for IMU-Camera Calibration" (pay-walled paper, google-able).
I am using the first text because it better describes the structure of the ESKF, and the second because it includes details about the vision measurement model. In my question I will be using the terminology from the first text: 'nominal state', 'error state' and 'true state'; which refer to the IMU integrator, Kalman Filter, and the composition of the two (nominal minus errors).
The diagram below shows the structure of my ESKF implemented in Matlab/Simulink; in case you are not familiar with Simulink I will briefly explain the diagram. The green section is the Nominal State integrator, the blue section is the ESKF, and the red section is the sum of the nominal and error states. The 'RT' blocks are 'Rate Transitions' which can be ignored.
My first question: Is this structure correct?
My second question: How are the error-state equations for the measurement models derived?
In my case I have tried using the measurement model of the second text, but it did not work.
Kind Regards,
Your block diagram combines two indirect methods for bringing IMU data into a KF:
You have an external IMU integrator (in green, labelled "INS", sometimes called the mechanization, and described by you as the "nominal state", but I've also seen it called the "reference state"). This method freely integrates the IMU externally to the KF and is usually chosen so you can do this integration at a different (much higher) rate than the KF predict/update step (the indirect form). Historically I think this was popular because the KF is generally the computationally expensive part.
You have also fed your IMU into the KF block as u, which I am assuming is the "command" input to the KF. This is an alternative to the external integrator. In a direct KF you would treat your IMU data as measurements. In order to do that, the IMU would have to model (position, velocity, and) acceleration and (orientation and) angular velocity: Otherwise there is no possible H such that Hx can produce estimated IMU output terms). If you instead feed your IMU measurements in as a command, your predict step can simply act as an integrator, so you only have to model as far as velocity and orientation.
You should pick only one of those options. I think the second one is easier to understand, but it is closer to a direct Kalman filter, and requires you to predict/update for every IMU sample, rather than at the (I assume) slower camera framerate.
Regarding measurement equations for version (1), in any KF you can only predict things you can know from your state. The KF state in this case is a vector of error terms, and thus you can only predict things like "position error". As a result you need to pre-condition your measurements in z to be position errors. So make your measurement the difference between your "estimated true state" and your position from "noisy camera observations". This exact idea may be represented by the xHat input to the indirect KF. I don't know anything about the MATLAB/Simulink stuff going on there.
Regarding real-world considerations for the summing block (in red) I refer you to another answer about indirect Kalman filters.
Q1) Your SIMULINK model looks to be appropriate. Let me shed some light on quaternion mechanization based KF's which I've worked on for navigation applications.
Since Kalman Filter is an elegant mathematical technique which borrows from the science of stochastics and measurement, it can help you reduce the noise from the system without the need for elaborately modeling the noise.
All KF systems start with some preliminary understanding of the model that you want to make free of noise. The measurements are fed back to evolve the states better (the measurement equation Y = CX). In your case, the states that you are talking about are errors in quartenions which would be the 4 values, dq1, dq2, dq3, dq4.
KF working well in your application would accurately determine the attitude/orientation of the device by controlling the error around the quaternion. The quaternions are spatial orientation of any body, understood using a scalar and a vector, more specifically an angle and an axis.
The error equations that you are talking about are covariances which contribute to Kalman Gain. The covariances denote spread around the mean and they are useful in understanding how the central/ average behavior of the system is changing with time. Low covariances denote less deviation from the mean behavior for any system. As KF cycles run the covariances keep getting smaller.
The Kalman Gain is finally used to compensate for the error between the estimates of the measurements and the actual measurements that are coming in from the camera.
Again, this elegant technique first ensures that the error in the quaternion values converge around zero.
Q2) EKF is a great technique to use as long as you have a non-linear measurement construction technique. Be very careful in using EKF if their are too many transformations in your system, i.e don't try to reconstruct measurements using transformation on your states, this seriously affects the model sanctity and since noise covariances would not undergo similar transformations, there would be a chance of hitting singularity as soon as matrices are non-invertible.
You could look at constant gain KF schemes, which would save you from covariance propagation and save substantial computation effort and time. These techniques are quite new and look very promising. They actively absorb P(error covariance), Q(model noise covariance) and R(measurement noise covariance) and work well with EKF schemes.