Identify parameters of ARIMA model - matlab

I am trying to build ARIMA model, I have 144 terms in my standardized time series, which represent residuals form original time series. This residuals, on which I would like to build ARIMA model, are obtained when I subtracted linear trend and periodical component from original time series, so residuals are stochastic component.
Because of that subtraction I modeled residuals like stationary series (d=0), so model is ARIMA(p,d,q)=ARIMA(?,0,?).
ACF and PACF functions of my residuals are not very clear as cases in literature for identification ARIMA models, and when I choose parameters p and q according to criteria that they are last values outside of confidence interval, I got values p=109, q=97. Matlab gave me error for this case:
Error using arima/estimate (line 386)
Input response series has an insufficient number of observations.
On the other side, when I am looking only to N/4 length of time series for identifying p and q parameters, I got p=36, q=34. Matlab gave me error for this case
Warning: Nonlinear inequality constraints are active; standard errors may be inaccurate.
In arima.estimate at 1113
Error using arima/validateModel (line 1306)
The non-seasonal autoregressive polynomial is unstable.
Error in arima/setLagOp (line 391)
Mdl = validateModel(Mdl);
Error in arima/estimate (line 1181)
Mdl = setLagOp(Mdl, 'AR' , LagOp([1 -coefficients(iAR)' ], 'Lags', [0 LagsAR ]));
How do I need to correct identify p and q parameters and what is wrong here? And wwhat does it mean in this partial autocorrelation diagram, why are last values so big?

This guide contains a lot of useful information about the correct estimation of ARIMA p and q parameters.
As long as I can remember from my studies, since ACF tails off after lag q - p and PACF tails off after lag p - q, the correct identification of p and q orders is not always straightforward and even the best practices provided by the above guide could not be enough to point you to the right direction.
Usually, a failproof approach is to apply an information criteria (like AIC, BIC or FPE) to several models with different orders of p and q. The model that presents the smallest value of the criterion is the best one. Let's say your maximum q and p desired order is 6 an that k is the number of observations, you could proceed as follows:
ll = zeros(6);
pq = zeros(6);
for p = 1:6
for q = 1:6
mod = arima(p,0,q);
[fit,~,fit_ll] = estimate(mod,Y,'print',false);
ll(p,q) = fit_ll;
pq(p,q) = p + q;
end
end
ll = reshape(ll,36,1);
pq = reshape(pq,36,1);
[~,bic] = aicbic(ll,pq+1,k);
bic = reshape(bic,6,6);
Once this is done, use the indices returned by the min function in order to find the optimal q and p orders.
On a side note, for what concerns your errors... well, the first one is pretty straightforward and is self-explanatory. The second one basically means that a correct model estimation is not possible.

Related

How does covariance matrix (P) in Kalman filter get updated in relation to measurements and state estimate?

I am in the midst of implementing a Kalman filter based AHRS in C++. There's something rather strange to me in the equations of the filter.
I can't find the part where the P (covariance) matrix is actually updated to represent uncertainty of predictions.
During the "predict" step P estimate is calculated from its previous value, A and Q. From what I understand A (system matrix) and Q (covariance of noise) are constant. Then during "Correct" P is calculated from K, H and predicted P. H (observation matrix) is constant, so the only variable that affects P is K (Kalman gain). But K is calculated from predicted P, H and R (observation noise) that are either constants or the P itself. So where is the part of the equations that makes P relate to x? To me it seems like P is recursively looping here depending only on the constants and initial value of P. This doesn't make any sense. What am I missing?
You are not missing anything.
It can come as a surprise to realise that, indeed, the state error covariance matrix (P) in a linear kalman filter does not depend on the the data (z). One way to lessen the surprise is to note what the covariance is saying: it is how uncertain you should be in the estimated state, given that the models you are using (effectively A,Q and H,R) are accurate. It is not saying: this is the uncertainty. By judicious tweaking of Q and R you could change P arbitrarily. In particular you should not interpret P as a 'quality' figure, but rather look at the observation residuals. You could, for example, make P smaller by reducing R. However then the residuals would be larger compared with their computed sds.
When the observations come in at a constant rate, and always the same set of observations, P will tend to a steady state that could, in principal, be computed ahead of time.
However there is no difficulty in applying the kalman filter when you have varying times between observations and varying sets of observations at each time, for example if you have various sensor systems with different sampling periods. In this case you will see more variation in P, though again in principal this could be computed ahead of time.
Further the kalman filter can be extended (in various ways, eg the extended kalman filter and the unscented kalman filter) to handle non linear dynamics and non linear observations. In this case because the transition matrix (A) and the observation model matrix (H) have a state dependency, so too will P.

How do I align two signals in MATLAB [duplicate]

I want to get the offset in samples between two datasets in Matlab (getting them synced in time), a quite common issue. Therefore I use the cross correlation function xcorr or the cross covariance function xcov (both provide similar results in most cases for this purpose). With artificial data it works fine, but I struggle with "real" data, even though it should be pretty much the same. Matlab always says the offset would be zero. I'm using this simple piece of code:
[crossCorr] = xcov(b, c);
[~, peakIndex] = max(crossCorr())
offset = peakIndex - length(b)
I've posted a fully runable example m-file with a downsampled data excerpt on pastebin:
Code with data on pastebin
EDIT: The downsampled excerpt seems to be not fully suitable for evaluating the effect. Here's a much larger sample with the original frequency, pease use this one instead. Unfortunately it was too big for pastebin.
As the plot shows it should be no problem at all to get the offset via cross covariance. I also tried to scale the data nicer in order to avoid numerical problems, but that didn't change anything at all.
Would be great if someone could tell me my mistake.
There's nothing wrong with your method in principle, I used exactly the same approach successfully for temporally aligning different audio recordings of the same signal.
However, it appears that for your time series, correlation (or covariance) is simply not the right measure to compare shifted versions – possibly because they contain components of a time scale comparable to the total length. An alternative is to use residual variance, i.e. the variance of the difference between shifted versions. Here is a (not particularly elegant) implementation of this idea:
lags = -1000 : 1000;
v = nan(size(lags));
for i = 1 : numel(lags)
lag = lags(i);
if lag >= 0
v(i) = var(b(1 + lag : end) - c(1 : end - lag));
else
v(i) = var(b(1 : end + lag) - c(1 - lag : end));
end
end
[~, ind] = min(v);
minlag = lags(ind);
For your (longer) data set, this results in minlag = 169. Plotting residual variance over lags gives:
Your data has a minor peak around 5 and a major peak around 101.
If I knew something about my data then I could might window around an acceptable range of offsets as shown below.
Code for initial exploration:
figure; clc;
subplot(2,1,1)
plot(1:numel(b), b);
hold on
plot(1:numel(c), c, 'r');
legend('b','c')
subplot(2,1,2)
plot(crossCorr,'.b-')
hold on
plot(peakIndex,crossCorr(peakIndex),'or')
legend('crossCorr','peak')
Initial Image:
If you zoom into the first peak you can see that it is not only high around 5, but it is polynomial "enough" to allow sub-element offsets. That is convenient.
Image showing :
Here is what the curve-fitting tool gives as the analytic for a cubic:
Linear model Poly3:
f(x) = p1*x^3 + p2*x^2 + p3*x + p4
Coefficients (with 95% confidence bounds):
p1 = 8.515e-013 (8.214e-013, 8.816e-013)
p2 = -3.319e-011 (-3.369e-011, -3.269e-011)
p3 = 2.253e-010 (2.229e-010, 2.277e-010)
p4 = -4.226e-012 (-7.47e-012, -9.82e-013)
Goodness of fit:
SSE: 2.799e-024
R-square: 1
Adjusted R-square: 1
RMSE: 6.831e-013
You can note that the SSE fits to roundoff.
If you compute the root (near n=4) you use the following matlab code:
% Coefficients
p1 = 8.515e-013
p2 = -3.319e-011
p3 = 2.253e-010
p4 = -4.226e-012
% Linear model Poly3:
syms('x')
f = p1*x^3 + p2*x^2 + p3*x + p4
xz1=fzero(#(y) subs(diff(f),'x',y), 4)
and you get the analytic root at 4.01420240431444.
EDIT:
Hmmm. How about fitting a gaussian mixture model to the convolution? You sweep through a good range of component count, you do between 10 and 30 repeats, and you find which component count has the best/lowest BIC. So you fit a gmdistribution to the lower subplot of the first figure, then test the covariance at the means of the components in decreasing order.
I would try the offset at the means, and just look at sum squared error. I would then pick the offset that has the lowest error.
Procedure:
compute cross correlation
fit cross correlation to Gaussian Mixture model
sweep a reasonable range of components (start with 1-10)
use a reasonable number of repeats (10 to 30 depending on run-to-run variation)
compute Bayes Information Criterion (BIC) for each level, pick the lowest because it indicates a reasonable balance of error and parameter count
each component is going to have a mean, evaluate that mean as a candidate offset and compute sum-squared error (sse) when you offset like that.
pick the offset of the component that gives best SSE
Let me know how well that works.
If the two signals misalign by non-integer number of samples, e.g. 3.7 samples, then the xcorr method may find the max value at 4 samples, it won't be able to find the accurate time shift. In this case, you should try a method called "unified change detection". The web-link for the paper is:
[http://www.phmsociety.org/node/1404/]
Good Luck.

How do I determine the coefficients for a linear regression line in MATLAB? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'm going to write a program where the input is a data set of 2D points and the output is the regression coefficients of the line of best fit by minimizing the minimum MSE error.
I have some sample points that I would like to process:
X Y
1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25
How would I do this in MATLAB?
Specifically, I need to get the following formula:
y = A + Bx + e
A is the intercept and B is the slope while e is the residual error per point.
Judging from the link you provided, and my understanding of your problem, you want to calculate the line of best fit for a set of data points. You also want to do this from first principles. This will require some basic Calculus as well as some linear algebra for solving a 2 x 2 system of equations. If you recall from linear regression theory, we wish to find the best slope m and intercept b such that for a set of points ([x_1,y_1], [x_2,y_2], ..., [x_n,y_n]) (that is, we have n data points), we want to minimize the sum of squared residuals between this line and the data points.
In other words, we wish to minimize the cost function F(m,b,x,y):
m and b are our slope and intercept for this best fit line, while x and y are a vector of x and y co-ordinates that form our data set.
This function is convex, so there is an optimal minimum that we can determine. The minimum can be determined by finding the derivative with respect to each parameter, and setting these equal to 0. We then solve for m and b. The intuition behind this is that we are simultaneously finding m and b such that the cost function is jointly minimized by these two parameters. In other words:
OK, so let's find the first quantity :
We can drop the factor 2 from the derivative as the other side of the equation is equal to 0, and we can also do some distribution of terms by multiplying the -x_i term throughout:
Next, let's tackle the next parameter :
We can again drop the factor of 2 and distribute the -1 throughout the expression:
Knowing that is simply n, we can simplify the above to:
Now, we need to simultaneously solve for m and b with the above two equations. This will jointly minimize the cost function which finds the best line of fit for our data points.
Doing some re-arranging, we can isolate m and b on one side of the equations and the rest on the other sides:
As you can see, we can formulate this into a 2 x 2 system of equations to solve for m and b. Specifically, let's re-arrange the two equations above so that it's in matrix form:
With regards to above, we can decompose the problem by solving a linear system: Ax = b. All you have to do is solve for x, which is x = A^{-1}*b. To find the inverse of a 2 x 2 system, given the matrix:
The inverse is simply:
Therefore, by substituting our quantities into the above equation, we solve for m and b in matrix form, and it simplifies to this:
Carrying out this multiplication and solving for m and b individually, this gives:
As such, to find the best slope and intercept to best fit your data, you need to calculate m and b using the above equations.
Given your data specified in the link in your comments, we can do this quite easily:
%// Define points
X = 1:5;
Y = [1 2 1.3 3.75 2.25];
%// Get total number of points
n = numel(X);
% // Define relevant quantities for finding quantities
sumxi = sum(X);
sumyi = sum(Y);
sumxiyi = sum(X.*Y);
sumxi2 = sum(X.^2);
sumyi2 = sum(Y.^2);
%// Determine slope and intercept
m = (sumxi * sumyi - n*sumxiyi) / (sumxi^2 - n*sumxi2);
b = (sumxiyi * sumxi - sumyi * sumxi2) / (sumxi^2 - n*sumxi2);
%// Display them
disp([m b])
... and we get:
0.4250 0.7850
Therefore, the line of best fit that minimizes the error is:
y = 0.4250*x + 0.7850
However, if you want to use built-in MATLAB tools, you can use polyfit (credit goes to Luis Mendo for providing the hint). polyfit determines the line (or nth order polynomial curve rather...) of best fit by linear regression by minimizing the sum of squared errors between the best fit line and your data points. How you call the function is so:
coeff = polyfit(x,y,order);
x and y are the x and y points of your data while order determines the order of the line of best fit you want. As an example, order=1 means that the line is linear, order=2 means that the line is quadratic and so on. Essentially, polyfit fits a polynomial of order order given your data points. Given your problem, order=1. As such, given the data in the link, you would simply do:
X = 1:5;
Y = [1 2 1.3 3.75 2.25];
coeff = polyfit(X,Y,1)
coeff =
0.4250 0.7850
The way coeff works is that these are the coefficients of the regression line, starting from the highest order in decreasing value. As such, the above coeff variable means that the regression line was fitted as:
y = 0.4250*x + 0.7850
The first coefficient is the slope while the second coefficient is the intercept. You'll also see that this matches up with the link you provided.
If you want a visual representation, here's a plot of the data points as well as the regression line that best fits these points:
plot(X, Y, 'r.', X, polyval(coeff, X));
Here's the plot:
polyval takes an array of coefficients (usually produced by polyfit), and you provide a set of x co-ordinates and it calculates what the y values are given the values of x. Essentially, you are evaluating what the points are along the best fit line.
Edit - Extending to higher orders
If you want to extend so that you're finding the best fit for any nth order polynomial, I won't go into the details, but it boils down to constructing the following linear system. Given the relationship for the ith point between (x_i, y_i):
You would construct the following linear system:
Basically, you would create a vector of points y, and you would construct a matrix X such that each column denotes taking your vector of points x and applying a power operation to each column. Specifically, the first column is the zero-th power, the first column is the first power, the second column is the second power and so on. You would do this up until m, which is the order polynomial you want. The vector of e would be the residual error for each point in your set.
Specifically, the formulation of the problem can be written in matrix form as:
Once you construct this matrix, you would find the parameters by least-squares by calculating the pseudo-inverse. How the pseudo-inverse is derived, you can read it up on the Wikipedia article I linked to, but this is the basis for minimizing a system by least-squares. The pseudo-inverse is the backbone behind least-squares minimization. Specifically:
(X^{T}*X)^{-1}*X^{T} is the pseudo-inverse. X itself is a very popular matrix, which is known as the Vandermonde matrix and MATLAB has a command called vander to help you compute that matrix. A small note is that vander in MATLAB is returned in reverse order. The powers decrease from m-1 down to 0. If you want to have this reversed, you'd need to call fliplr on that output matrix. Also, you will need to append one more column at the end of it, which is the vector with all of its elements raised to the mth power.
I won't go into how you'd repeat your example for anything higher order than linear. I'm going to leave that to you as a learning exercise, but simply construct the vector y, the matrix X with vander, then find the parameters by applying the pseudo-inverse of X with the above to solve for your parameters.
Good luck!

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

MATLAB: Approximate tomorrow's temperature with 2nd, 3rd and 4th polynomial using the Least Squares method

The following is Exercise 3 of a Numerical Analysis task I have to do as part of my university course on the subject.
Find an approximation of tomorrow's temperature based on the last 23
values of hourly temperature of your city ( Meteorological history for
Thessaloniki {The city of my univ} can be found here:
http://freemeteo.com)
You will approximate the temperature function with a polynomial of
2nd, 3rd and 4th degree, using the Least Squares method. Following
that, you will find the value of the function at the point that
interests you. Compare your approximations qualitatively and make a
note to the time and date you're doing the approximation on.
Maybe it's due to fatigue due to doing the first two tasks without break, or it's my lack of experience on numerical analysis, but I am completely stumped. I do not even know where to start.
I know it's disgusting to ask for a solution without even showing signs of effort, but I would appreciate anything. Leads, tutorials, outlines of the things I need to work on, one after the other, anything.
I'd be very much obliged to you.
NOTE: I am not able to use any MATLAB in-built approximation functions.
In general, if y is your data vector belonging to the times t, and c is the coefficient vector you're interested in, then you need to solve the linear system
Ac = y
in a least-squares sense, where
A = bsxfun(#power, t(:), 0:n)
In MATLAB you can do this with mldivide:
c = A\y(:)
Example:
>> t = 0 : 0.1 : 20; %// Define some times
>> y = pi + 0.8*t - 3.2*t.^2; %// Create some synthetic data
>> y = y + randn(size(y)); %// Add some noise for good measure
>>
>> n = 2; %// The order of the polynomial for the fit
>> A = bsxfun(#power, t(:), 0:n); %// Design matrix
>> c = A\y(:) %// Solve for the coefficient matrix
c =
3.142410118189416e+000
7.978077631488009e-001 %// Which works pretty well
-3.199865079047185e+000
But since you are not allowed to use any built-in functions, you can use this simple solution only to check your own outcomes. You'll have to write an implementation of the equations given on (for example) Wolfram's MathWorld.