How does covariance matrix (P) in Kalman filter get updated in relation to measurements and state estimate? - filtering

I am in the midst of implementing a Kalman filter based AHRS in C++. There's something rather strange to me in the equations of the filter.
I can't find the part where the P (covariance) matrix is actually updated to represent uncertainty of predictions.
During the "predict" step P estimate is calculated from its previous value, A and Q. From what I understand A (system matrix) and Q (covariance of noise) are constant. Then during "Correct" P is calculated from K, H and predicted P. H (observation matrix) is constant, so the only variable that affects P is K (Kalman gain). But K is calculated from predicted P, H and R (observation noise) that are either constants or the P itself. So where is the part of the equations that makes P relate to x? To me it seems like P is recursively looping here depending only on the constants and initial value of P. This doesn't make any sense. What am I missing?

You are not missing anything.
It can come as a surprise to realise that, indeed, the state error covariance matrix (P) in a linear kalman filter does not depend on the the data (z). One way to lessen the surprise is to note what the covariance is saying: it is how uncertain you should be in the estimated state, given that the models you are using (effectively A,Q and H,R) are accurate. It is not saying: this is the uncertainty. By judicious tweaking of Q and R you could change P arbitrarily. In particular you should not interpret P as a 'quality' figure, but rather look at the observation residuals. You could, for example, make P smaller by reducing R. However then the residuals would be larger compared with their computed sds.
When the observations come in at a constant rate, and always the same set of observations, P will tend to a steady state that could, in principal, be computed ahead of time.
However there is no difficulty in applying the kalman filter when you have varying times between observations and varying sets of observations at each time, for example if you have various sensor systems with different sampling periods. In this case you will see more variation in P, though again in principal this could be computed ahead of time.
Further the kalman filter can be extended (in various ways, eg the extended kalman filter and the unscented kalman filter) to handle non linear dynamics and non linear observations. In this case because the transition matrix (A) and the observation model matrix (H) have a state dependency, so too will P.

Related

Computing the SVD of a rectangular matrix

I have a matrix like M = K x N ,where k is 49152 and is the dimension of the problem and N is 52 and is the number of observations.
I have tried to use [U,S,V]=SVD(M) but doing this I get less memory space.
I found another code which uses [U,S,V]=SVD(COV(M)) and it works well. My questions are what is the meaning of using the COV(M) command inside the SVD and what is the meaning of the resultant [U,S,V]?
Finding the SVD of the covariance matrix is a method to perform Principal Components Analysis or PCA for short. I won't get into the mathematical details here, but PCA performs what is known as dimensionality reduction. If you like a more formal treatise on the subject, you can read up on my post about it here: What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?. However, simply put dimensionality reduction projects your data stored in the matrix M onto a lower dimensional surface with the least amount of projection error. In this matrix, we are assuming that each column is a feature or a dimension and each row is a data point. I suspect the reason why you are getting more memory occupied by applying the SVD on the actual data matrix M itself rather than the covariance matrix is because you have a significant amount of data points with a small amount of features. The covariance matrix finds the covariance between pairs of features. If M is a m x n matrix where m is the total number of data points and n is the total number of features, doing cov(M) would actually give you a n x n matrix, so you are applying SVD on a small amount of memory in comparison to M.
As for the meaning of U, S and V, for dimensionality reduction specifically, the columns of V are what are known as the principal components. The ordering of V is in such a way where the first column is the first axis of your data that describes the greatest amount of variability possible. As you start going to the second columns up to the nth column, you start to introduce more axes in your data and the variability starts to decrease. Eventually when you hit the nth column, you are essentially describing your data in its entirety without reducing any dimensions. The diagonal values of S denote what is called the variance explained which respect the same ordering as V. As you progress through the singular values, they tell you how much of the variability in your data is described by each corresponding principal component.
To perform the dimensionality reduction, you can either take U and multiply by S or take your data that is mean subtracted and multiply by V. In other words, supposing X is the matrix M where each column has its mean computed and the is subtracted from each column of M, the following relationship holds:
US = XV
To actually perform the final dimensionality reduction, you take either US or XV and retain the first k columns where k is the total amount of dimensions you want to retain. The value of k depends on your application, but many people choose k to be the total number of principal components that explains a certain percentage of your variability in your data.
For more information about the link between SVD and PCA, please see this post on Cross Validated: https://stats.stackexchange.com/q/134282/86678
Instead of [U, S, V] = svd(M), which tries to build a matrix U that is 49152 by 49152 (= 18 GB 😱!), do svd(M, 'econ'). That returns the “economy-class” SVD, where U will be 52 by 52, S is 52 by 52, and V is also 52 by 52.
cov(M) will remove each dimension’s mean and evaluate the inner product, giving you a 52 by 52 covariance matrix. You can implement your own version of cov, called mycov, as
function [C] = mycov(M)
M = bsxfun(#minus, M, mean(M, 1)); % subtract each dimension’s mean over all observations
C = M' * M / size(M, 1);
(You can verify this works by looking at mycov(randn(49152, 52)), which should be close to eye(52), since each element of that array is IID-Gaussian.)
There’s a lot of magical linear algebraic properties and relationships between the SVD and EVD (i.e., singular value vs eigenvalue decompositions): because the covariance matrix cov(M) is a Hermitian matrix, it’s left- and right-singular vectors are the same, and in fact also cov(M)’s eigenvectors. Furthermore, cov(M)’s singular values are also its eigenvalues: so svd(cov(M)) is just an expensive way to get eig(cov(M)) 😂, up to ±1 and reordering.
As #rayryeng explains at length, usually people look at svd(M, 'econ') because they want eig(cov(M)) without needing to evaluate cov(M), because you never want to compute cov(M): it’s numerically unstable. I recently wrote an answer that showed, in Python, how to compute eig(cov(M)) using svd(M2, 'econ'), where M2 is the 0-mean version of M, used in the practical application of color-to-grayscale mapping, which might help you get more context.

Kalman Filter prediction error estimation: why two constants and transposed matrices?

Hy everybody!
I have found a very informative and good tutorial for understanding Kalman Filter. In the end, I would like to understand the Extended Kalman Filter in the second half of the tutorial, but first I want to solve any mystery.
Kalman Filter tutorial Part 6.
I think we use constant for prediction error, because the new value in a certain k time moment can be different, than the previous. But why we use two constants? It says:
we multiply twice by a because the prediction error pk is itself a squared error; hence, it is scaled by the square of the coefficient associated with the state value xk.
I can't see the meaning of this sentence.
And later in the EKF he creates a matrix and a transposed matrix from that (in Part 12). Why the transposed one?
Thanks a lot.
The Kalman filter maintains error estimates as variances, which are squared standard deviations. When you multiply a Gaussian random variable N(x,p) by a constant a, you increase its standard deviation by a factor of a, which means its variance increases as a^2. He's writing this as a*p*a to maintain a parallel structure when he converts from a scalar state to a matrix state. If you have an error coviarance matrix P representing state x, then the error covariance of Ax is APA^T as he shows in part 12. It's a convenient shorthand for doing that calculation. You can expand the matrix multiplication by hand to see that the coefficients all go in the right place.
If any of this is fuzzy to you, I strongly recommend you read a tutorial on Gaussian random variables. Between x and P in a Kalman filter, your success depends a lot more on you understanding P than x, even though most people get started by being interested in improving x.

Numerical Instability Kalman Filter in MatLab

I am trying to run a standard Kalman Filter algorithm to calculate likelihoods, but I keep getting a problema of a non positive definite variance matrix when calculating normal densities.
I've researched a little and seen that there may be in fact some numerical instabitlity; tried some numerical ways to avoid a non-positive definite matrix, using both choleski decomposition and its variant LDL' decomposition.
I am using MatLab.
Does anyone suggest anything?
Thanks.
I have encountered what might be the same problem before when I needed to run a Kalman filter for long periods but over time my covariance matrix would degenerate. It might just be a problem of losing symmetry due to numerical error. One simple way to enforce your covariance matrix (let's call it P) to remain symmetric is to do:
P = (P + P')/2 # where P' is transpose(P)
right after estimating P.
post your code.
As a rule of thumb, if the model is not accurate and the regularization (i.e. the model noise matrix Q) is not sufficiently "large" an underfitting will occur and the covariance matrix of the estimator will be ill-conditioned. Try fine tuning your Q matrix.
The Kalman Filter implemented using the Joseph Form is known to be numerically unstable, as any old timer who once worked with single precision implementation of the filter can tell. This problem was discovered zillions of years ago and prompt a lot of research in implementing the filter in a stable manner. Probably the best well-known implementation is the UD, where the Covariance matrix is factorized as UDU' and the two factors are updated and propagated using special formulas (see Thoronton and Bierman). U is an upper diagonal matrix with "1" in its diagonal, and D is a diagonal matrix.

Stochastic spread method for pairs trading by Elliot et. al (2005) - Kalman filter + EM algorithm in MATLAB, am I doing something wrong?

I am implementing the Stochastic spread method for pairs trading by Elliott et. al (2005).
The procedure consists of modeling the spread between two stocks, log(P1)-log(P2), as a mean reverting process, calibrated from market observations.
The hidden state process for the spread can be written like this:
x_{t+1} = A + Bx_t + Ce_{t+1}
The observation process is:
y_t = x_t + D*w_t
Both e_t and w_t are i.i.d. Gauusian N(0,1).
Elliott gives the Kalman filter equations in his paper, which I have implemented in my code for the updating step:
function [xt_t,st_t,xt_tm,kt,st_tm]=EMupdate(DATA_t,xt_t_m1,st_t_m1,A,B,C2,D2)
st_tm=B^2*st_t_m1+C2;
kt=st_tm/(st_tm+D2);
xt_tm=A+B*xt_t_m1;
xt_t=xt_tm+kt*(DATA_t-xt_tm);
st_t=st_tm-kt*st_tm;
where
xt_t is x_{t|t}
xt_t_m1 is x_{t-1|t-1}
xt_tm is x_{t|t-1}
st_t is s_{t|t} (the MSE, denoted as P in e.g. Hamilton (1994))
st_t_m1 is s_{t-1|t-1}
st_tm is s_{t|t-1}
kt is the kalman gain for time t
DATA_t is the observed data for time t, y_t
A, B, C2, D2 are the estimated parameters (which I have estimated using the EM algorithm in another code).
This update step is done every time a new data point arrives. I am storing all the x's, s's and k's in vectors. I am supposed to compare y_t with x_{t|t-1}, and given a large deviation of the two, a trade should be initiated. However, the two follows each other very closely, and I am unsure whether I have done something wrong:
Can someone see if I am doing wrong?
Please tell me if I should link more of my code.
UPDATE: My procedure: (P is the same as s above)
To generate the spread between two stocks, I take the difference between the log-prices: y=log(p1)-log(p2).
I set a training period of 252 days, where I estimate the initial parameters (A, B, C2 and D2) using the EM algorithm. I implement the EM algorithm using all the data for the training period; that is y(1), y(2), ..., y(252) as well as initial guesses for A, B, C2 and D2:
2a. I set x_{1|1}=y(1). Furthermore I set the MSE, P_{1|1}=D2, my initial guess for D^2.
2b. I recursively calculate Kalman filters, x_{t|t}, x_{t+1|t}, P_{t|t}, P_{t+1|t} and k_{t} for all t=1...252 (the entire training period) using my initial guesses for A, B, C2 and D2.
2c. After I have calculated the kalman filters for the entire training period, I (backward) recursively calculate Kalman smoothers for the entire training period as well: t=1...252. These are x_{t|T}, P_{t|T}, P_{t,t-1|T} and j_{t}.
I then compute the log-likelihood value and the updated values for A, B, C2 and D2. Then I repeat the steps from 1 until the log-likelihood converges and I obtain optimal values for A, B, C2 and D2.
Is it correct to calculate Kalman filters for the entire training period before starting to calculate Kalman smoothers? Or should I, for example, calculate Kalman filters up till t=2, then Kalman smoothers for T=2, then Kalman filters up till t=3, then smoothers for T=3 etc.?
Now I have values for A, B, C2 and D2 and can begin my testperiod, also 252 days. I don't update my estimates for A, B, C2 and D2, but keep them constant. For each new observation I can compute Kalman filters (the same as in 2b). Finally I can compare y(t) to x_{t|t-1} for the training period.
My results look like this:
While a paper by Chen, Ren and Lu have the following results:
NB: Not the same security... but the difference is obvious nonetheless.
It seems that either you're underestimating the noise variance from the training data, or that your training data is not stationary in the period of your training window. try to increase the noise variance and you'll see that the filter actually smooths the time series. Your current under-estimation of noise variance leads the kalman filter to "forget" the past and give your last sample high weighting.
checking it is quite easy. Increase the measurement noise/error variance (the matrix R in Kalman filter) and see how it affects the output.
If the model is no longer linear-Gaussian Kalman filter will not be optimal. However, it still should smooth your data, so start "training" it until it provides acceptable prediction.

local inverse of a neural network

I have a neural network with N input nodes and N output nodes, and possibly multiple hidden layers and recurrences in it but let's forget about those first. The goal of the neural network is to learn an N-dimensional variable Y*, given N-dimensional value X. Let's say the output of the neural network is Y, which should be close to Y* after learning. My question is: is it possible to get the inverse of the neural network for the output Y*? That is, how do I get the value X* that would yield Y* when put in the neural network? (or something close to it)
A major part of the problem is that N is very large, typically in the order of 10000 or 100000, but if anyone knows how to solve this for small networks with no recurrences or hidden layers that might already be helpful. Thank you.
If you can choose the neural network such that the number of nodes in each layer is the same, and the weight matrix is non-singular, and the transfer function is invertible (e.g. leaky relu), then the function will be invertible.
This kind of neural network is simply a composition of matrix multiplication, addition of bias and transfer function. To invert, you'll just need to apply the inverse of each operation in the reverse order. I.e. take the output, apply the inverse transfer function, multiply it by the inverse of the last weight matrix, minus the bias, apply the inverse transfer function, multiply it by the inverse of the second to last weight matrix, and so on and so forth.
This is a task that maybe can be solved with autoencoders. You also might be interested in generative models like Restricted Boltzmann Machines (RBMs) that can be stacked to form Deep Belief Networks (DBNs). RBMs build an internal model h of the data v that can be used to reconstruct v. In DBNs, h of the first layer will be v of the second layer and so on.
zenna is right.
If you are using bijective (invertible) activation functions you can invert layer by layer, subtract the bias and take the pseudoinverse (if you have the same number of neurons per every layer this is also the exact inverse, under some mild regularity conditions).
To repeat the conditions: dim(X)==dim(Y)==dim(layer_i), det(Wi) not = 0
An example:
Y = tanh( W2*tanh( W1*X + b1 ) + b2 )
X = W1p*( tanh^-1( W2p*(tanh^-1(Y) - b2) ) -b1 ), where W2p and W1p represent the pseudoinverse matrices of W2 and W1 respectively.
The following paper is a case study in inverting a function learned from Neural Networks. It is a case study from the industry and looks a good beginning for understanding how to go about setting up the problem.
An alternate way of approaching the task of getting the desired x that yields desired y would be start with random x (or input as seed), then through gradient decent (similar algorithm to back propagation, difference being that instead of finding derivatives of weights and biases, you find derivatives of x. Also, mini batching is not needed.) repeatedly adjust x until it yields a y that is close to the desired y. This approach has an advantage that it allows an input of a seed (starting x, if not randomly selected). Also, I have a hypothesis that the final x will have some similarity to initial x(seed), which would imply that this algorithm has the ability to transpose, depending on the context of the neural network application.