Inconsistency when estimating AR model coefficients in MATLAB - matlab

I'm trying to estimate the coefficients of an AR[2] model
x(t) = a_1*x(t-1) + a_2*x(t-2) + e(t), e(t) ~ N(0, sigma^2)
in MATLAB. For a_1 = 2*cos(2*pi/T)*exp(-1/tau), a_2 = -exp(-2/tau), the AR[2] model corresponds to a linear damped oscillator with period T and relaxation time tau. I simulated some data for this process with T = 30 and tau = 100 which corresponds to a_1 = 1.9368, a_2 = -0.9802:
T = 30; tau = 100;
a_1 = 2*cos(2*pi/T)*exp(-1/tau); a_2 = -exp(-2/tau);
simuMdl = arima(2,0,0);
simuMdl.Constant = 0;
simuMdl.Variance = 1e-1;
simuMdl.AR{1} = a_1;
simuMdl.AR{2} = a_2;
data = simulate(simuMdl, 600);
data = data(501:end);
plot(data)
I only take the last 100 timepoints to make sure the system is not influenced by the initial conditions any more. Now, when trying to estimate the parameters, everything works just fine when using the estimate command that uses maximum likelihood estimation:
ToEstMdl = arima(2,0,0); ToEstMdl.Constant = 0;
EstMdl = estimate(ToEstMdl, data);
EstMdl.AR
%'[1.9319] [-0.9745]'
However, when I use the Yule-Walker-Equations implemented in aryule, I get a completely different result that does not match the true parameter values at all:
aryule(data, 2)
%'1.0000 -1.4645 0.5255'
Does anyone have an idea why the Yule-Walker-equations have such shortcomings to the MLE approach?

Yule-Walker (YW) is a method of moment based method. As such its estimate would get better with increasing data points. You can check it in this example by using all 600 data points to see what is the 'best' YW estimate you can get had you used all the data points and the MLE would still be better than it. You can also increase the data points to say 5000 instead of 600 and you will see in this case the best YW (the one that uses all 5000 points) would start to approach the MLE estimate.

Related

Iterative quantile estimation in Matlab

I'm trying to implement an interative algorithm to estimate quantiles in data that is generated from a Monte-Carlo simulation. I want to make it iterative, because I have many iterations and variables so storing all data points and using Matlab's quantile function would take much of the memory that I actually need for the simulation.
I found some approaches based on the Robbin-Monro process, given by
The implementation with a control sequence ct = c / t where c is constant is quite straight forward. In the cited paper, they show that c = 2 * sqrt(2 * pi) gives quite good results, at least for the median. But they also propose an adaptive approach based on an estimation of the histogram. Unfortunately, I haven't figured out how to implement this adaptation yet.
I tested the implementation with a constant c for three test samples with 10.000 data points. The value c = 2 * sqrt(2 * pi) did not work well for me, but c = 100 looks quite good for the test samples. However, this selction is not very robust and failed in the actual Monte-Carlo simulation giving results wide off the mark.
probabilities = [0.1, 0.4, 0.7];
controlFactor = 100;
quantile = zeros(size(probabilities));
indicator = zeros(size(probabilities));
for index = 1:length(data)
control = controlFactor / index;
indices = (data(index) >= quantile);
indicator(indices) = probabilities(indices);
indices = (data(index) < quantile);
indicator(indices) = probabilities(indices) - 1;
quantile = quantile + control * indicator;
end
Is there a more robust solution for iterative quantile estimation or does anyone have an implementation for an adaptive approach with small memory consumption?
After trying some of the adaptive iterative approaches that I found in literature without great success (not sure, if I did it right), I came up with a solution that gives me good results for my test samples and also for the actual Monte-Carlo-Simulation.
I buffer a subset of simulation results, compute the sample quantiles and average over all subset sample quantiles in the end. This seems to work quite well and without tuning many parameters. The only parameter is the buffer size which is 100 in my case.
The results converge quite fast and increasing sample size does not improve the results dramatically. There seems to be a small but constant bias that presumably is the averaged error of the subset sample quantiles. And that is the downside of my solution. By choosing the buffer size, one fixes the achievable accuracy. Increasing the buffer size reduces this bias. In the end, it seems to be a memory and accuracy tradeoff.
% Generate data
rng('default');
data = sqrt(0.5) * randn(10000, 1) + 5 * rand(10000, 1) + 10;
% Set parameters
probabilities = 0.2;
% Compute reference sample quantiles
quantileEstimation1 = quantile(data, probabilities);
% Estimate quantiles with computing the mean over a number of subset
% sample quantiles
subsetSize = 100;
quantileSum = 0;
for index = 1:length(data) / subsetSize;
quantileSum = quantileSum + quantile(data(((index - 1) * subsetSize + 1):(index * subsetSize)), probabilities);
end
quantileEstimation2 = quantileSum / (length(data) / subsetSize);
% Estimate quantiles with iterative computation
quantileEstimation3 = zeros(size(probabilities));
indicator = zeros(size(probabilities));
controlFactor = 2 * sqrt(2 * pi);
for index = 1:length(data)
control = controlFactor / index;
indices = (data(index) >= quantileEstimation3);
indicator(indices) = probabilities(indices);
indices = (data(index) < quantileEstimation3);
indicator(indices) = probabilities(indices) - 1;
quantileEstimation3 = quantileEstimation3 + control * indicator;
end
fprintf('Reference result: %f\nSubset result: %f\nIterative result: %f\n\n', quantileEstimation1, quantileEstimation2, quantileEstimation3);

Reverse-calculating original data from a known moving average

I'm trying to estimate the (unknown) original datapoints that went into calculating a (known) moving average. However, I do know some of the original datapoints, and I'm not sure how to use that information.
I am using the method given in the answers here: https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average, but in MATLAB (my code below). This method works quite well for large numbers of data points (>1000), but less well with fewer data points, as you'd expect.
window = 3;
datapoints = 150;
data = 3*rand(1,datapoints)+50;
moving_averages = [];
for i = window:size(data,2)
moving_averages(i) = mean(data(i+1-window:i));
end
length = size(moving_averages,2)+(window-1);
a = (tril(ones(length,length),window-1) - tril(ones(length,length),-1))/window;
a = a(1:length-(window-1),:);
ai = pinv(a);
daily = mtimes(ai,moving_averages');
x = 1:size(data,2);
figure(1)
hold on
plot(x,data,'Color','b');
plot(x(window:end),moving_averages(window:end),'Linewidth',2,'Color','r');
plot(x,daily(window:end),'Color','g');
hold off
axis([0 size(x,2) min(daily(window:end))-1 max(daily(window:end))+1])
legend('original data','moving average','back-calculated')
Now, say I know a smattering of the original data points. I'm having trouble figuring how might I use that information to more accurately calculate the rest. Thank you for any assistance.
You should be able to calculate the original data exactly if you at any time can exactly determine one window's worth of data, i.e. in this case n-1 samples in a window of length n. (In your case) if you know A,B and (A+B+C)/3, you can solve now and know C. Now when you have (B+C+D)/3 (your moving average) you can exactly solve for D. Rinse and repeat. This logic works going backwards too.
Here is an example with the same idea:
% the actual vector of values
a = cumsum(rand(150,1) - 0.5);
% compute moving average
win = 3; % sliding window length
idx = hankel(1:win, win:numel(a));
m = mean(a(idx));
% coefficient matrix: m(i) = sum(a(i:i+win-1))/win
A = repmat([ones(1,win) zeros(1,numel(a)-win)], numel(a)-win+1, 1);
for i=2:size(A,1)
A(i,:) = circshift(A(i-1,:), [0 1]);
end
A = A / win;
% solve linear system
%x = A \ m(:);
x = pinv(A) * m(:);
% plot and compare
subplot(211), plot(1:numel(a),a, 1:numel(m),m)
legend({'original','moving average'})
title(sprintf('length = %d, window = %d',numel(a),win))
subplot(212), plot(1:numel(a),a, 1:numel(a),x)
legend({'original','reconstructed'})
title(sprintf('error = %f',norm(x(:)-a(:))))
You can see the reconstruction error is very small, even using the data sizes in your example (150 samples with a 3-samples moving average).

Simulation of Markov chains

I have the following Markov chain:
This chain shows the states of the Spaceship, which is in the asteroid belt: S1 - is serviceable, S2 - is broken. 0.12 - the probability of destroying the Spaceship by a collision with an asteroid. 0.88 - the probability of that a collision will not be critical. Need to find the probability of a serviceable condition of the ship after the third collision.
Analytical solution showed the response - 0.681. But it is necessary to solve this problem by simulation method using any modeling tool (MATLAB Simulink, AnyLogic, Scilab, etc.).
Do you know what components should be used to simulate this process in Simulink or any other simulation environment? Any examples or links.
First, we know the three step probability transition matrix contains the answer (0.6815).
% MATLAB R2019a
P = [0.88 0.12;
0 1];
P3 = P*P*P
P(1,1) % 0.6815
Approach 1: Requires Econometrics Toolbox
This approach uses the dtmc() and simulate() functions.
First, create the Discrete Time Markov Chain (DTMC) with the probability transition matrix, P, and using dtmc().
mc = dtmc(P); % Create the DTMC
numSteps = 3; % Number of collisions
You can get one sample path easily using simulate(). Pay attention to how you specify the initial conditions.
% One Sample Path
rng(8675309) % for reproducibility
X = simulate(mc,numSteps,'X0',[1 0])
% Multiple Sample Paths
numSamplePaths = 3;
X = simulate(mc,numSteps,'X0',[numSamplePaths 0]) % returns a 4 x 3 matrix
The first row is the X0 row for the starting state (initial condition) of the DTMC. The second row is the state after 1 transition (X1). Thus, the fourth row is the state after 3 transitions (collisions).
% 50000 Sample Paths
rng(8675309) % for reproducibility
k = 50000;
X = simulate(mc,numSteps,'X0',[k 0]); % returns a 4 x 50000 matrix
prob_survive_3collisions = sum(X(end,:)==1)/k % 0.6800
We can bootstrap a 95% Confidence Interval on the mean probability to survive 3 collisions to get 0.6814 ± 0.00069221, or rather, [0.6807 0.6821], which contains the result.
numTrials = 40;
ProbSurvive_3collisions = zeros(numTrials,1);
for trial = 1:numTrials
Xtrial = simulate(mc,numSteps,'X0',[k 0]);
ProbSurvive_3collisions(trial) = sum(Xtrial(end,:)==1)/k;
end
% Mean +/- Halfwidth
alpha = 0.05;
mean_prob_survive_3collisions = mean(ProbSurvive_3collisions)
hw = tinv(1-(0.5*alpha), numTrials-1)*(std(ProbSurvive_3collisions)/sqrt(numTrials))
ci95 = [mean_prob_survive_3collisions-hw mean_prob_survive_3collisions+hw]
maxNumCollisions = 10;
numSamplePaths = 50000;
ProbSurvive = zeros(maxNumCollisions,1);
for numCollisions = 1:maxNumCollisions
Xc = simulate(mc,numCollisions,'X0',[numSamplePaths 0]);
ProbSurvive(numCollisions) = sum(Xc(end,:)==1)/numSamplePaths;
end
For a more complex system you'll want to use Stateflow or SimEvents, but for this simple example all you need is a single Unit Delay block (output = 0 => S1, output = 1 => S2), with a Switch block, a Random block, and some comparison blocks to construct the logic determining the next value of the state.
Presumably you must execute the simulation a (very) large number of times and average the results to get a statistically significant output.
You'll need to change the "seed" of the random generator each time you run the simulation.
This can be done by setting the seed to be "now" (or something similar to that).
Alternatively you could quite easily vectorize the model so that you only need to execute it once.
If you want to simulate this, it is fairly easy in matlab:
servicable = 1;
t = 0;
while servicable =1
t = t+1;
servicable = rand()<=0.88
end
Now t represents the amount of steps before the ship is broken.
Wrap this in a for loop and you can do as many simulations as you like.
Note that this can actually give you the distribution, if you want to know it after 3 times, simply add && t<3 to the while condition.

Logistic regression in Matlab, confused about the results

I am testing out logistic regression in Matlab on 2 datasets created from the audio files:
The first set is created via wavread by extracting vectors of each file: the set is 834 by 48116 matrix. Each traning example is a 48116 vector of the wav's frequencies.
The second set is created by extracting frequencies of 3 formants of the vowels, where each formant(feature) has its' frequency range (for example, F1 range is 500-1500Hz, F2 is 1500-2000Hz and so on). Each training example is a 3-vector of the wav's formants.
I am implementing the algorithm like so:
Cost function and gradient:
h = sigmoid(X*theta);
J = sum(y'*log(h) + (1-y)'*log(1-h)) * -1/m;
grad = ((h-y)'*X)/m;
theta_partial = theta;
theta_partial(1) = 0;
J = J + ((lambda/(2*m)) * (theta_partial'*theta_partial));
grad = grad + (lambda/m * theta_partial');
where X is the dataset and y is the output matrix of 8 classes.
Classifier:
initial_theta = zeros(n + 1, 1);
options = optimset('GradObj', 'on', 'MaxIter', 50);
for c = 1:num_labels,
[theta] = fmincg(#(t)(lrCostFunction(t, X, (y==c), lambda)), initial_theta, options);
all_theta(c, :) = theta';
end
where num_labels = 8, lambda(regularization) is 0.1
With the first set, MaxIter = 50, and I get ~99.8% classification accuracy.
With the second set and MaxIter=50, the accuracy is poor - 62.589928
I thought about increasing MaxIter to a larger value to improve the performance, however, even at a ridiculous amount of iterations, the result doesn't go higher than 66.546763. Changing of the regularization value (lambda) doesn't seem to influence the results in any better way.
What could be the problem? I am new to machine learning and I can't seem to catch what exactly causes this drastic difference. The only reason that obviously stands out for me is that the first set's examples are very long vectors, hence, larger amount of features, and the second set's examples are represented by short 3-vectors. Is this data not enough to classify the second set? If so, what can be done about it to achieve better classification results for the second set?

Frequency array feeds FFT

The final goal I am trying to achieve is the generation of a ten minutes time series: to achieve this I have to perform an FFT operation, and it's the point I have been stumbling upon.
Generally the aimed time series will be assigned as the sum of two terms: a steady component U(t) and a fluctuating component u'(t). That is
u(t) = U(t) + u'(t);
So generally, my code follows this procedure:
1) Given data
time = 600 [s];
Nfft = 4096;
L = 340.2 [m];
U = 10 [m/s];
df = 1/600 = 0.00167 Hz;
fn = Nfft/(2*time) = 3.4133 Hz;
This means that my frequency array should be laid out as follows:
f = (-fn+df):df:fn;
But, instead of using the whole f array, I am only making use of the positive half:
fpos = df:fn = 0.00167:3.4133 Hz;
2) Spectrum Definition
I define a certain spectrum shape, applying the following relationship
Su = (6*L*U)./((1 + 6.*fpos.*(L/U)).^(5/3));
3) Random phase generation
I, then, have to generate a set of complex samples with a determined distribution: in my case, the random phase will approach a standard Gaussian distribution (mu = 0, sigma = 1).
In MATLAB I call
nn = complex(normrnd(0,1,Nfft/2),normrnd(0,1,Nfft/2));
4) Apply random phase
To apply the random phase, I just do this
Hu = Su*nn;
At this point start my pains!
So far, I only generated Nfft/2 = 2048 complex samples accounting for the fpos content. Therefore, the content accounting for the negative half of f is still missing. To overcome this issue, I was thinking to merge the real and imaginary part of Hu, in order to get a signal Huu with Nfft = 4096 samples and with all real values.
But, by using this merging process, the 0-th frequency order would not be represented, since the imaginary part of Hu is defined for fpos.
Thus, how to account for the 0-th order by keeping a procedure as the one I have been proposing so far?