Kalman Filter : How measurement noise covariance matrix and process noise helps in working of kalman filter , can some one explain intuitively? - filtering

How process noise covariance and measurement noise covariance are helping better functioning of Kalman filter ?
Can someone explain intuitively without significant equations and math please.

Well, its difficult to explain mathematical things (like kalman filters) without mathematics, but here's my attempt:
There are two parts to a kalman filter, a time update part and a measurement part. In the time update part we estimate the state at the time of observation; in the measurement part we combine (via least squares) our 'predictions' (ie the estimate from the time update) with the measurements to get a new estimate of the state.
So far, no mention of noise. There are two sources of noise: one in the time update part (sometimes called process noise) and one in the measurement part (observation noise). In each case what we need is a measure of the 'size' of that noise, ie the covariance matrix. These are used when we combine the
predictions with the measurements. When we view our predictions as very uncertain (that is, they have a large covariance matrix) the combination will be closer to the measurements than to the predictions; on the other hand when we view our predictions as very good (small covariance) the combination will be closer to the predictions than to the measurements.
So you could look upon the process and observation noise covariances as saying how much to trust the (parts of) the predictions and observations. Increasing, say, the variance of a particular component of the predictions is to say: trust this prediction less; while increasing the variance of a particular measurement is to say: trust this measurement less. This is mostly an analogy but it can be made more precise. A simple case is when the covariance matrices are diagonal. In that case the cost, ie the contrinution to what we are trying to minimise, of a difference between an measurement and the computed value is te square of that difference, divided by the observations variance. So the higher an observations variance, the lower the cost.
Note that out of the measurement part we also get a new state covariance matrix; this is used (along with the process noise and the dynamics) in the next time update when we compute the predicted state covariance.
I think the question of why the covariance is the appropriate measure of the size of the noise is rather a deep one, as is why least squares is the appropriate way to combine the predictions and the measurements. The shallow answer is that kalman filtering and least squares have been found, over decades (centuries in the case of least squares), to work well in many application areas. In the case of kalman filtering I find the derivation of it from hidden markobv models (From Hidden Markov Models to Linear Dynamical Systems by T.Minka, though this is rather mathematical) convincing. In Hidden markov models we seek to find the (conditional) probability of the states given the measurements so far; Minka shows that if the measurements are linear functions of the states and the dynamics are linear and all probability distributions are Gaussian, then we get the kalman filter.

Related

Fourier transform and filtering the MD trajectory data instead of PCA for dimensionality reduction?

I was using PCA for dimensionality reduction of MD (molecular dynamics) trajectory data of some protein simulations. Basically my data is xyz coordinates of protein atoms which change with time (that means I have lot of frames of this xyz coordinates). The dimension of this data is something like 20000 frames of 200x3 (atoms by coordinates). I implemented PCA using princomp command in Matlab.
I was wondering if I can do FFT on my data. I have experience of doing FFT on audio signals (1D signal). Here my data has both time and space in picture. It must be theoretically possible to implement FFT on my data and then filter it using a LPF (low pass filter). But I am unsure.
Can someone give me some direction/code snippets/references towards implementing FFT on my data?
Why people are preferring PCA more often compared to FFT and filtering. Is it because of computational efficiency of algorithm or is it because of the statistical nature of underlying data?
For the first question "Can someone give me some direction/code snippets/references towards implementing FFT on my data?":
I should say fft is implemented in matlab and you do not need to implement it by your own. Also, for your case you should use fftn (fft documentation)to transform and after applying lowpass filtering by dessignfilt (design filter in matalab), the apply ifftn (inverse fft in matlab)to inverse the transform.
For the second question "Why people are preferring PCA more often compared to FFT and filtering ...":
I should say as the filtering in fft is done in signal space, after filtering you can't generalize it in time space. You can more details about this drawback in this article.
But, Fourier analysis has also some other
serious drawbacks. One of them may be that time
information is lost in transforming to the frequency
domain. When looking at a Fourier transform of a
signal, it is impossible to tell when a particular event
has taken place. If it is a stationary signal - this
drawback isn't very important. However, most
interesting signals contain numerous non-stationary or
transitory characteristics: drift, trends, abrupt changes,
and beginnings and ends of events. These
characteristics are often the most important part of the
signal, and Fourier analysis is not suitable in detecting
them.

Basic Hidden Markov Model, Viterbi algorithm

I am fairly new to Hidden Markov Models and I am trying to wrap my head around a pretty basic part of the theory.
I would like to use a HMM as a classifier, so, given a time series of data I have two classes: background and signal.
How are the emission probabilities estimated for each class? Does the Viterbi algorithm need a template of the background and signal to estimate prob(data|state)? Or have I completely missed the point?
To do classification with Viterbi you need to already know the model parameters.
Background and Signal are your two hidden states. With the model parameters and observed data you want to use Viterbi to calculate the most likely sequence of hidden states.
To quote the hmmlearn documentation:
The HMM is a generative probabilistic model, in which a sequence of
observable X variables is generated by a sequence of internal hidden
states Z. The hidden states are not be observed directly. The
transitions between hidden states are assumed to have the form of a
(first-order) Markov chain. They can be specified by the start
probability vector π and a transition probability matrix A. The
emission probability of an observable can be any distribution with
parameters θ conditioned on the current hidden state. The HMM is
completely determined by π, A and θ
.
There are three fundamental problems for HMMs:
Given the model parameters and observed data, estimate the optimal sequence of hidden states.
Given the model parameters and observed data, calculate the likelihood of the data.
Given just the observed data, estimate the model parameters.
The first and the second problem can be solved by the dynamic
programming algorithms known as the Viterbi algorithm and the
Forward-Backward algorithm, respectively. The last one can be solved
by an iterative Expectation-Maximization (EM) algorithm, known as the
Baum-Welch algorithm.
So we've got two states for our Hidden Markov model, noise and signal. We also must have something we observe, which could be ones and zeroes. Basically, ones are signal and zeroes are noise, but you get a few zeros mixed in with your signal and a few ones with the noise. So you need to know
Probablity of 0,1 when in the "noise" state
Probability of 0,1 when in the "signal" state
Probability of transitioning to "signal" when in the "noise" state.
Probability of transitioning to "noise" when in the "signal" state.
So we keep track of the probability of each state for each time slot and, crucially, the most likely route we got there (based on the transition probabilities). Then we assume that the most probable state at the end of the time series is our actual final state, and trace backwards.
The Viterbi Algorithm requires to know the HMM.
The HMM can be estimated with a Maximum-Likely-Estimation (MLE) and it's called the Baum–Welch algorithm.
If you have trouble with the Viterbi Algorithm there's a working implementation here

Why does classifier accuracy drop after PCA, even though 99% of the total variance is covered?

I have a 500x1000 feature vector and principal component analysis says that over 99% of total variance is covered by the first component. So I replace 1000 dimension point by 1 dimension point giving 500x1 feature vector(using Matlab's pca function). But, my classifier accuracy which was initially around 80% with 1000 features now drops to 30% with 1 feature even though more than 99% of the variance is accounted by this feature. What could be the explanation to this or are my methods wrong?
(This question partly arises from my earlier question Significance of 99% of variance covered by the first component in PCA)
Edit:
I used weka's principal components method to perform the dimensionality reduction and support vector machines(SVM) classifier.
Principal Components do not necessarily have any correlation to classification accuracy. There could be a 2-variable situation where 99% of the variance corresponds to the first PC but that PC has no relation to the underlying classes in the data. Whereas the second PC (which only contributes to 1% of the variance) is the one that can separate the classes. If you only keep the first PC, then you lose the feature that actually provides the ability to classify the data.
In practice, smaller (lower variance) PCs often are associated with noise so there can be benefit in removing them but there is no guarantee of this.
Consider a case where you have two variables: a person's mass (in grams) and body temperature (in degrees Celsius). You want to predict which people have the flu and which do not. In this case, weight has a much greater variance but probably no correlation to the flu, whereas temperature, which has low variance, has a strong correlation to the flu. After the Principal Components transformation, the first PC will be strongly aligned with mass (since it has much greater variance) so if you dropped the second PC, would be losing almost all of your classification accuracy.
It is important to remember that Principal Components is an unsupervised transformation of the data. It does not consider labels of your training data when calculating the transformation (as opposed to something like Fisher's linear discriminant).

Matlab: Help in running toolbox for Kalman filter

I have AR(1) model with data samples $N=500$ that is driven by a random input sequence x. THe observation y is corrupted with measurement noise $v$ of zero mean. The model is
y(t) = 0.195y(t-1) + x(t) + v(t) where x(t) is generated as randn(). I am unsure how to represent this as a state space model and how to estimate the parameters $a$ and the states. I tried the state space representation would be
d(t) = \mathbf{a^T} d(t) + x(t)
y(t) = \mathbf{h^T}d(t) + sigma*v(t)
sigma =2.
I cannot understand how to perform parameter and state estimation. Using the toolbox mentioned below, I checked the Equations of KF to be matching with those in textbooks. However, the approach for parameter estimation is different. I will appreciate a recommendation for the implementation procedure.
Implementation 1:
I am following the implementation here : Learning Kalman Filter. This implementation does not use Expectation Maximization to estimate the parameters of AR model and it finds out the Covariance of the process noise. In my case, I don't have a process noise, but an input $x$.
Implementation 2: Kalman Filter by Kevin Murphy is another toolbox which uses EM for parameter estimation of AR model. Now, it is confusing since both the implementations uses different approach for parameter estimation.
I am having a tough time figuring out the correct approach, the state space model representation and the code. Shall appreciate recommendations on how to proceed.
I ran the first implementation for the KalmanARSquareRoot technique and the result is completely different. There is Exponential Moving Average Smoothing being performed and a MA filer of length 30 being used. The toolbox runs fine if I run the Demo examples. But on changing the model, the result is extremely poor. Maybe I am doing something wrong. Do I need to change the equations of KF for my time series?
In the second implementation, I cannot figure out what and where to change the Equations.
In general, if I have to use these tools, then do I need to change the KF equations for every time series model? How do I write the Equations myself if these toolboxes are inappropriate for all the time series model - AR, MA, ARMA?
I only have a bit of experience with Kalman Filters, so take this with a grain of salt.
It seems you shouldn't need to change the equations at all. Working with the second package (learn_kalman), you can create an A0 matrix of size [length(d(t)) length(d(t))]. C0 is the same, and in your case the initial state probably makes sense to be the Identity matrix (unlike your A0. All you need to do is choose a good initial condition.
However, I took a look at your system (plotted an example) and it seems noise dominates your system. KF is an optimal estimator but I have not known it to reject that much noise. It only guarantees a reduced covariance...which means that if your system is mostly dominated by noise, you will calculate a bad model that estimates your system given the noise!
Try plotting [d f] where d is the initial data and f is calculated using the regressive formula f(t) = C * A * f(t-1) :
f(t) = A * f(t-1)
; y(t) = C * f(t)
That is, pretend as if there is no noise but using the estimated AR model. You will see that it rejects all the noise and 'technically' models the system well (since the only unique behaviour is at the very beginning).
For example, if you have a system with A = 0.195, Q=R=0.l then you will converge to an A = 0.2207 but it still isn't good enough. Here the problem is that your initial state is so low, that within a few steps of data and you are essentially at 0 accounting for noise. Naturally KF can converge to a LOT of model solutions that are similar. Any noise will throw off even the best initial condition.
If you increase the resolution of your data in some way (e.g. larger initial condition, more refined timesteps) you will see a good fit. Ex, changing your initial condition to 110 and you'll find the two curves similar, though the model is still fairly different.
I am not aware of any approach to model your data well. If the noise variance is in fact 1 and your system converges to 0 that quickly, it seems doomed to not be effectively modelled since you just don't capture any unique behaviour in the dataset.

Deconvolution of data convolved by a Gaussian response

I have a set of experimental data s(t) which consists of a vector (with 81 points as a function of time t).
From the physics, this is the result of the convolution of the system response e(t) with a probe p(t), which is a Gaussian (actually a laser pulse). In terms of vector, its FWHM covers approximately 15 points in time.
I want to deconvolve this data in Matlab using the convolution theorem: FT{e(t)*p(t)}=FT{e(t)}xFT{p(t)} (where * is the convolution, x the product and FT the Fourier transform).
The procedure itself is no problem, if I suppose a Dirac function as my probe, I recover exactly the initial signal (which makes sense, measuring a system with a Dirac gives its impulse response)
However, the Gaussian case as a probe, as far as I understood turns out to be a critical one. When I divide the signal in the Fourier space by the FT of the probe, the wings of the Gaussian highly amplifies those frequencies and I completely loose my initial signal instead of having a deconvolved one.
From your experience, which method could be used here (like Hamming windows or any windowing technique, or...) ? This looks rather pretty simple but I did not find any easy way to follow in signal processing and this is not my field.
You have noise in your experimental data, do you? The problem is ill-posed then (non-uniquely solvable) and you need regularization.
If the noise is Gaussian the keywords are Tikhonov regularization or Wiener filtering.
Basically, add a positive regularization factor that acts as a lowpass filter. In your notation the estimation of the true curve o(t) then becomes:
o(t) = FT^-1(FT(e)*conj(FT(p))/(abs(FT(p))^2+l))
with a suitable l>0.
You're trying to do Deconvolution process by assuming the Filter Model is Gaussian Blur.
Few notes for doing Deconvolution:
Since your data is real (Not synthetic) data it includes some kind of Noise.
Hence it is better to use the Wiener Filter (Even with the assumption of low variance noise). Otherwise, the "Deconvolution Filter" will increase the noise significantly (As it is an High Pass basically).
When doing the division in the Fourier Domain zero pad the signals to the correct size or better yet create the Gaussian Filter in the time domain with the same number of samples as the signal.
Boundaries will create artifact, Windowing might be useful.
There are many more sophisticated methods for Deconvolution by defining a more sophisticated model on the signal and the noise. If you have more prior data about them, you should look for this kind of framework.
You can always set a threshold on the amplification level for certain frequencies, do that if needed.
Use as much samples as you can.
I hope this will assist you.