Matlab: Help in running toolbox for Kalman filter - matlab

I have AR(1) model with data samples $N=500$ that is driven by a random input sequence x. THe observation y is corrupted with measurement noise $v$ of zero mean. The model is
y(t) = 0.195y(t-1) + x(t) + v(t) where x(t) is generated as randn(). I am unsure how to represent this as a state space model and how to estimate the parameters $a$ and the states. I tried the state space representation would be
d(t) = \mathbf{a^T} d(t) + x(t)
y(t) = \mathbf{h^T}d(t) + sigma*v(t)
sigma =2.
I cannot understand how to perform parameter and state estimation. Using the toolbox mentioned below, I checked the Equations of KF to be matching with those in textbooks. However, the approach for parameter estimation is different. I will appreciate a recommendation for the implementation procedure.
Implementation 1:
I am following the implementation here : Learning Kalman Filter. This implementation does not use Expectation Maximization to estimate the parameters of AR model and it finds out the Covariance of the process noise. In my case, I don't have a process noise, but an input $x$.
Implementation 2: Kalman Filter by Kevin Murphy is another toolbox which uses EM for parameter estimation of AR model. Now, it is confusing since both the implementations uses different approach for parameter estimation.
I am having a tough time figuring out the correct approach, the state space model representation and the code. Shall appreciate recommendations on how to proceed.
I ran the first implementation for the KalmanARSquareRoot technique and the result is completely different. There is Exponential Moving Average Smoothing being performed and a MA filer of length 30 being used. The toolbox runs fine if I run the Demo examples. But on changing the model, the result is extremely poor. Maybe I am doing something wrong. Do I need to change the equations of KF for my time series?
In the second implementation, I cannot figure out what and where to change the Equations.
In general, if I have to use these tools, then do I need to change the KF equations for every time series model? How do I write the Equations myself if these toolboxes are inappropriate for all the time series model - AR, MA, ARMA?

I only have a bit of experience with Kalman Filters, so take this with a grain of salt.
It seems you shouldn't need to change the equations at all. Working with the second package (learn_kalman), you can create an A0 matrix of size [length(d(t)) length(d(t))]. C0 is the same, and in your case the initial state probably makes sense to be the Identity matrix (unlike your A0. All you need to do is choose a good initial condition.
However, I took a look at your system (plotted an example) and it seems noise dominates your system. KF is an optimal estimator but I have not known it to reject that much noise. It only guarantees a reduced covariance...which means that if your system is mostly dominated by noise, you will calculate a bad model that estimates your system given the noise!
Try plotting [d f] where d is the initial data and f is calculated using the regressive formula f(t) = C * A * f(t-1) :
f(t) = A * f(t-1)
; y(t) = C * f(t)
That is, pretend as if there is no noise but using the estimated AR model. You will see that it rejects all the noise and 'technically' models the system well (since the only unique behaviour is at the very beginning).
For example, if you have a system with A = 0.195, Q=R=0.l then you will converge to an A = 0.2207 but it still isn't good enough. Here the problem is that your initial state is so low, that within a few steps of data and you are essentially at 0 accounting for noise. Naturally KF can converge to a LOT of model solutions that are similar. Any noise will throw off even the best initial condition.
If you increase the resolution of your data in some way (e.g. larger initial condition, more refined timesteps) you will see a good fit. Ex, changing your initial condition to 110 and you'll find the two curves similar, though the model is still fairly different.
I am not aware of any approach to model your data well. If the noise variance is in fact 1 and your system converges to 0 that quickly, it seems doomed to not be effectively modelled since you just don't capture any unique behaviour in the dataset.

Related

How to guarantee convergence when training a neural differential equation?

I'm currently working through the SciML tutorials workshop exercises for the Julia language (https://tutorials.sciml.ai/html/exercises/01-workshop_exercises.html). Specifically, I'm stuck on exercise 6 part 3, which involves training a neural network to approximate the system of equations
function lotka_volterra(du,u,p,t)
x, y = u
α, β, δ, γ = p
du[1] = dx = α*x - β*x*y
du[2] = dy = -δ*y + γ*x*y
end
The goal is to replace the equation for du[2] with a neural network: du[2] = NN(u, p)
where NN is a neural net with parameters p and inputs u.
I have a set of sample data that the network should try to match. The loss function is the squared difference between the network model's output and that sample data.
I defined my network with
NN = Chain(Dense(2,30), Dense(30, 1)). I can get Flux.train! to run, but the problem is that sometimes the initial parameters for the neural network result in a loss on the order of 10^20 and so training never converges. My best try got the loss down from about 2000 initially to about 20 using the ADAM optimizer over about 1000 iterations, but I can't seem to do any better.
How can I make sure my network is consistently trainable, and is there a way to get better convergence?
How can I make sure my network is consistently trainable, and is there a way to get better convergence?
See the FAQ page on techniques for improving convergence. In a nutshell, the single shooting approach of most ML papers is very unstable and does not work on most practical problems, but there are a litany of techniques to help out. One of the best ones is multiple shooting, which optimizes only short bursts (in parallel) along the time series.
But training on a small interval and growing the interval works, also using more stable optimizers (BFGS) can work. You can also weigh the loss function so that earlier times mean more. Lastly, you can minibatch in a way similar to multiple shooting, i.e. start from a data point and only solve to the next (in fact, if you actually look at the original neural ODE paper NumPy code, they do not do the algorithm as explained but instead do this form of sampling to stabilize the spiral ODE training).

Kalman Filter : How measurement noise covariance matrix and process noise helps in working of kalman filter , can some one explain intuitively?

How process noise covariance and measurement noise covariance are helping better functioning of Kalman filter ?
Can someone explain intuitively without significant equations and math please.
Well, its difficult to explain mathematical things (like kalman filters) without mathematics, but here's my attempt:
There are two parts to a kalman filter, a time update part and a measurement part. In the time update part we estimate the state at the time of observation; in the measurement part we combine (via least squares) our 'predictions' (ie the estimate from the time update) with the measurements to get a new estimate of the state.
So far, no mention of noise. There are two sources of noise: one in the time update part (sometimes called process noise) and one in the measurement part (observation noise). In each case what we need is a measure of the 'size' of that noise, ie the covariance matrix. These are used when we combine the
predictions with the measurements. When we view our predictions as very uncertain (that is, they have a large covariance matrix) the combination will be closer to the measurements than to the predictions; on the other hand when we view our predictions as very good (small covariance) the combination will be closer to the predictions than to the measurements.
So you could look upon the process and observation noise covariances as saying how much to trust the (parts of) the predictions and observations. Increasing, say, the variance of a particular component of the predictions is to say: trust this prediction less; while increasing the variance of a particular measurement is to say: trust this measurement less. This is mostly an analogy but it can be made more precise. A simple case is when the covariance matrices are diagonal. In that case the cost, ie the contrinution to what we are trying to minimise, of a difference between an measurement and the computed value is te square of that difference, divided by the observations variance. So the higher an observations variance, the lower the cost.
Note that out of the measurement part we also get a new state covariance matrix; this is used (along with the process noise and the dynamics) in the next time update when we compute the predicted state covariance.
I think the question of why the covariance is the appropriate measure of the size of the noise is rather a deep one, as is why least squares is the appropriate way to combine the predictions and the measurements. The shallow answer is that kalman filtering and least squares have been found, over decades (centuries in the case of least squares), to work well in many application areas. In the case of kalman filtering I find the derivation of it from hidden markobv models (From Hidden Markov Models to Linear Dynamical Systems by T.Minka, though this is rather mathematical) convincing. In Hidden markov models we seek to find the (conditional) probability of the states given the measurements so far; Minka shows that if the measurements are linear functions of the states and the dynamics are linear and all probability distributions are Gaussian, then we get the kalman filter.

How to smooth rectangular signal with high order rate-limiter in Simulink?

Imagine I have a rectangular reference value for the position/displacement x and I need to smooth it.
The math for translatoric movements is quite simple:
speed: v = x'
acceleration: a = v' = x''
jerk. j = a' = v'' = x'''
I need to limit all these values. So I thought about using rate limiters in Simulink:
This approach works perfect for ramp signals, as you can see in the following output:
BUT, my reference signals for x are no ramps, they are rectangles/steps. Hence the rate limiters are not working, because the derivatives they get to limit are already infinite and Simulink throws an error. How can I resolve this problem? Is there actually a more elegant way to implement the high order rate-limiters? I guess this approach could be unstable in some cases.
continue reading: related question
Even though it seems absurd, the following approach is working: integration and instant derivation does the trick:
leading to:
More elegant, faster and simpler solutions for the whole smoothing problem are highly appreciated!
It's generally not a good idea to differentiate signals in Simulink because of numerical issues, I would advise to start with the higher order derivatives (e.g. acceleration) and integrate, much more robust numerically. This is what the doc about the derivative block says:
The Derivative block output might be very sensitive to the dynamics of
the entire model. The accuracy of the output signal depends on the
size of the time steps taken in the simulation. Smaller steps allow a
smoother and more accurate output curve from this block. However,
unlike with blocks that have continuous states, the solver does not
take smaller steps when the input to this block changes rapidly.
Depending on the dynamics of the driving signal and model, the output
signal of this block might contain unexpected fluctuations. These
fluctuations are primarily due to the driving signal output and solver
step size.
Because of these sensitivities, structure your models to use
integrators (such as Integrator blocks) instead of Derivative blocks.
Integrator blocks have states that allow solvers to adjust step size
and improve accuracy of the simulation. See Circuit Model for an
example of choosing the best-form mathematical model to avoid using
Derivative blocks in your models.
See also Best-Form Mathematical Models for more details.
I was trying to do something similar. I was looking for a "Smooth Ramp". Here is what I found:
A simpler approach is to combine ramp with a second order lag. Then the signal approachs s-shape. And your derivatives will exist and be smooth as well. Only thing to remember is that the 2nd or lag must be critically damped.
Y(s) = H(s)*X(s) where H(s) = K*wo^2/(s^2 + 2*zeta*wo*s + wo^2). Here you define zeta = 1.0. Then the s-shape is retained for any K and wo value. Note that X(s) has already been hit by a ramp. In matlab or any other tools, linear ramp and 2nd lag are standard blocks.
Good luck!
I think the 'Transfer Fcn' block is what you're looking for.
If you leave the equation in the default form 1/(s+1) you have a low-pass filter which can be tuned to what you need by changing the numerator and denominator coefficients.

One Class SVM using LibSVM in Matlab - Conceptual

Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.

Can I use a Neural Network to obtain an estimate of the output series, only knowing the input?

Let's say I have a model
h(t) = F[h(t-1),h(t-2), ... , u(t-1), u(t-2), ...]
where F[] is a non-linear function of the variables included in the function.
So for example, h(t) could be:
h(t) = h(t-1) + u(t-1) + h(t-1)*u(t-1) + h(t-1)*h(t-2)
Now, for the sake of my problem, I only have the data series u(t) available to me. I don't have a series for h(t) nor do I know the model.
Is it possible for me to use the Neural Network Toolbox to generate a good non-linear estimate of h(t) by just providing u(t)? If so, what neural network do I use?
For me this is like teaching children multiplications without ever giving any hint what could be the right solution. You should at least be able to provide some kind of fitness function that estimates how good your ANN performs. Then you could use an evolutionary algorithm (e. g. CMA-ES) to optimize your ANN.
I'm assuming (h(t-1), h(t-2), ...) is a time series. I'll call (h(t-1), h(t-2), ...) time-series h and (u(t-1), u(t-2), ...) time-series u. So you are fitting an ANN model with knowledge of a current value for h called h(t) and a previous historical time series for u (time-series u).
If you could find a function for h(t) without knowing the previous h time-series then you would not have a function of h(t-1), h(t-2), etc. Mathematically this would mean that you do not have a dependence on the historical values for h.
It is possible that for certain domains your model could accurately predict h(t) given values of time-series u only but I would not trust such a model given that:
you say that h(t) has a non-linear dependence on previous values for h(t) and
you mention time-series h in the first place
This leads me to believe that you will be using the model in domains where time-series h is important and because the model is non-linear the error can increase dramatically once you get outside your fitted region. Even worse, without knowledge of the h time-series you wont even know where the "good fit" region is.
If you had the model, there might be some tricky way to get the h time-series given h(t) and the u time-series but I don't think that is what you are asking.