Mystery degree of freedom in VAR coefficient stderr - matlab

I've been testing the vector autoregressive coefficient estimation vgxvarx in Matlab's Econometrics toolbox. Once the coefficients are determined, vgxdisp gives you the choice of showing the standard errors estimated according to maximum likelihood or minimum bias. The only difference between the two is normalization by number of observations versus degrees of freedom, respectively. Since both are constant, you should be able to verify the 2 sets of standard errors by converting from one to the other. Just unnormalize by one constant and renormalize by the other.
I tried this and found that the minimum bias estimate of standard error seems to be off by one in the degrees of freedom. In the script below, I use vgxvarx to calculate VAR model coefficients and then request maximum likelihood and minimum bias estimates of their standard errors from vgxdisp (DoFAdj=false and true, respectively). To validate the two, I then convert the standard errors from ML to min bias by unnormalizing by the number of observations (nPoints) and renormalizing by degrees of freedom LESS ONE (found by trial and error). These scalings have to be square-rooted because they apply to variance and we're comparing standard errors.
I'm wondering if anyone can point out whether I am missing something basic that explains this mystery degree of freedom?
I've originally posted this to usenet. Here is a modification of the original code that natively sets the data so that it doesn't need to be obtained from http://www.econ.uiuc.edu/~econ472/eggs.txt.
clear variables
fnameDiary = [mfilename '.out.txt'];
if logical(exist(fnameDiary,'file'))
diary off
delete(fnameDiary)
end % if
diary(fnameDiary) % Also turns on diary
CovarType='full' % 'full'
nMaxLag=3
clf
tbChicEgg=table([
1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 ...
1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 ...
1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 ...
1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 ...
1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 ...
2002 2003 2004 2005 2006 ...
]',[
468491 449743 436815 444523 433937 389958 403446 423921 ...
389624 418591 438288 422841 476935 542047 582197 516497 ...
523227 467217 499644 430876 456549 430988 426555 398156 ...
396776 390708 383690 391363 374281 387002 369484 366082 ...
377392 375575 382262 394118 393019 428746 425158 422096 ...
433280 421763 404191 408769 394101 379754 378361 386518 ...
396933 400585 392110 384838 378609 364584 374000 370000 ...
373000 380000 356000 356000 353000 363000 371000 380000 ...
386000 388000 393000 410000 425000 437000 437000 444000 ...
444000 450000 454000 453000 453000 ...
]',[
3581 3532 3327 3255 3156 3081 3166 3443 3424 3561 3640 3840 ...
4456 5000 5366 5154 5130 5077 5032 5148 5404 5322 5323 5307 ...
5402 5407 5500 5442 5442 5542 5339 5358 5403 5345 5435 5474 ...
5540 5836 5777 5629 5704 5806 5742 5502 5461 5382 5377 5408 ...
5608 5777 5825 5625 5800 5656 5683 5700 5758 5867 5808 5600 ...
5675 5750 5892 5992 6158 6233 6367 6458 6650 6908 7058 7175 ...
7275 7292 7425 7500 7575 ...
]', ...
'VariableNames', {'year' 'chic' 'egg'} ...
);
seriesNames={'chic','egg'};
varChicEgg = vgxset( 'Series', seriesNames, 'n',2 );
chicEgg = table2array(tbChicEgg(:,seriesNames));
dChicEgg = diff(chicEgg);
dChicEgg = bsxfun( #minus, dChicEgg, mean(dChicEgg) ); % Make 0-mean
dChicEgg0 = dChicEgg(1:nMaxLag,:); % Presample-data
dChicEgg = dChicEgg(1+nMaxLag:end,:);
nPoints = length(dChicEgg)
yrs = table2array(tbChicEgg(1+nMaxLag:end,'year'));
yrs = yrs(1:nPoints);
subplot(3,1,1);
plotyy( yrs,dChicEgg(:,1) , yrs,dChicEgg(:,2) );
for DoFAdj = [false true]
% DoFAdj=1 means std err normalizes by df rather than n, where
% n=number of observations and df is n less the number of
% parameters estimated (from vgxdisp or vgxcount's NumParam)
[est.spec est.stdErr est.LLF est.W] = vgxvarx( ...
vgxset( varChicEgg, 'nAR',nMaxLag ), ...
dChicEgg, NaN, dChicEgg0, ...
'StdErrType', 'all', ...
'CovarType', CovarType ...
);
fprintf('-------------------------\nDoFAdj=%g\n',DoFAdj);
subplot(3,1,2+DoFAdj)
plotyy(yrs,est.W(:,1),yrs,est.W(:,2))
vgxdisp(est.spec,est.stdErr,'DoFAdj',DoFAdj);
end
fprintf('\nConvert ML stderr (DoFAdj=false) to min bias (DoFAdj=true):\n');
fprintf('Number of parameters: ')
[~,NumParam]=vgxcount(est.spec)
degreeFree = nPoints - NumParam
fprintf('\n');
stderr_ML_2_minBias=[
0.148195
21.1939
0.00104974
0.150127
0.160034
22.2911
0.0011336
0.157899
0.147694
20.9146
0.00104619
0.148148
6.43245e+07
381484
3227.54
] ...
* sqrt( nPoints / ( degreeFree - 1 ) );
for iParam = 1:length(stderr_ML_2_minBias)
disp(stderr_ML_2_minBias(iParam));
end
%--------------------------------------------------
diary off
% error('Stopping before return.');
return

Related

I'm using a version of pmdarima that no longer includes statsmodels ARIMA or ARMA class. How do I interpret SARIMAX without pdq?

auto_arima(df1['Births'],seasonal=False).summary()
SARIMAX Results
Dep. Variable: y No. Observations: 120
Model: SARIMAX Log Likelihood -409.745
Date: Mon, 23 Aug 2021 AIC 823.489
Time: 06:55:06 BIC 829.064
Sample: 0 HQIC 825.753
120
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept 39.7833 0.687 57.896 0.000 38.437 41.130
sigma2 54.1197 8.319 6.506 0.000 37.815 70.424
Ljung-Box (L1) (Q): 0.85 Jarque-Bera (JB): 2.69
Prob(Q): 0.36 Prob(JB): 0.26
Heteroskedasticity (H): 0.80 Skew: 0.26
Prob(H) (two-sided): 0.48 Kurtosis: 2.48
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
auto_arima(df1['Births'],seasonal=False)
auto_arima(df1['Births'],seasonal=False)

Matlab to calculate a spectral line parameter for each layer

I need to calculate a parameter defined as x,( this is defined in my code below) for the given spectral lines in each layer. My atmospheric profile has 10 layers. I know how to calculate x for just one layer. Then I get 5 values for x corresponding to each spectral line ( or wavelength) .
Suppose I want to do this for all 10 layers. Then my output should have 10 rows and 5 columns , size should be (10,5) , 10 represent number of the layer and 5 represent the spectral line. Any suggestion would be greatly appreciated
wl=[100 200 300 400 500]; %5 wavelengths, 5 spectral lines
br=[0.12 0.56 0.45 0.67 0.89]; % broadening parameter for each wavelength
p=[1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 ]; % pressure for 10 layers
T=[101 102 103 104 105 106 107 108 109 110]; % temperature for 10 layers
%suppose I want to caculate a parameter, x for all the layers
% x is defined as,( wavelength*br*T)/p
%when I do the calculation for the first layer,I have to consider all the
%wavelengths , all the broadening parameters and only the first value of
%pressure and only the first value of temperature
for i=1:5;
x(i)= (wl(i)*br(i)*T(1))/p(1);
end
% x is the x parameter for all the wavelengths in the first layer
%Now I want to calculate the x parameter for all the wavelengths in all 10
%layers
%my output should have 10 rows for 10 layers and 5 columns , size= (10,5)
you don't need loops for this case
>> (T./p)'*(wl.*br)
ans =
1.0e+05 *
0.0121 0.1131 0.1364 0.2707 0.4495
0.0136 0.1269 0.1530 0.3037 0.5043
0.0155 0.1442 0.1738 0.3451 0.5729
0.0178 0.1664 0.2006 0.3982 0.6611
0.0210 0.1960 0.2362 0.4690 0.7788
0.0254 0.2374 0.2862 0.5682 0.9434
0.0321 0.2996 0.3611 0.7169 1.1904
0.0432 0.4032 0.4860 0.9648 1.6020
0.0654 0.6104 0.7358 1.4606 2.4253
0.1320 1.2320 1.4850 2.9480 4.8950

matlab : vectorization and fitlm

I have a vectorization problem with nlinfit.
Let A = (n,p) the matrix of observations and t(1,p) the explanatory variable.
For ex
t=[0 1 2 3 4 5 6 7]
and
A=[3.12E-04 7.73E-04 3.58E-04 5.05E-04 4.02E-04 5.20E-04 1.84E-04 3.70E-04
3.38E-04 3.34E-04 3.28E-04 4.98E-04 5.19E-04 5.05E-04 1.97E-04 2.88E-04
1.09E-04 3.64E-04 1.82E-04 2.91E-04 1.82E-04 3.62E-04 4.65E-04 3.89E-04
2.70E-04 3.37E-04 2.03E-04 1.70E-04 1.37E-04 2.08E-04 1.05E-04 2.45E-04
3.70E-04 3.34E-04 2.63E-04 3.21E-04 2.52E-04 2.81E-04 6.25E+09 2.51E-04
3.11E-04 3.68E-04 3.65E-04 2.71E-04 2.69E-04 1.49E-04 2.97E-04 4.70E-04
5.48E-04 4.12E-04 5.55E-04 5.94E-04 6.10E-04 5.44E-04 5.67E-04 4.53E-04
....
]
I want to estimate a linear model for each row of A without looping and avoid the loop
for i=1:7
ml[i]=fitlm(A(i,:),t);
end
Thanks for your help !
Luc
I believe that your probem is about undertanding how fitlm works, for matrix:
Let's work with the hald example for matlab:
>> load hald
>> Description
Description =
== Portland Cement Data ==
Multiple regression data
ingredients (%):
column1: 3CaO.Al2O3 (tricalcium aluminate)
column2: 3CaO.SiO2 (tricalcium silicate)
column3: 4CaO.Al2O3.Fe2O3 (tetracalcium aluminoferrite)
column4: 2CaO.SiO2 (beta-dicalcium silicate)
heat (cal/gm):
heat of hardening after 180 days
Source:
Woods,H., H. Steinour, H. Starke,
"Effect of Composition of Portland Cement on Heat Evolved
during Hardening," Industrial and Engineering Chemistry,
v.24 no.11 (1932), pp.1207-1214.
Reference:
Hald,A., Statistical Theory with Engineering Applications,
Wiley, 1960.
>> ingredients
ingredients =
7 26 6 60
1 29 15 52
11 56 8 20
11 31 8 47
7 52 6 33
11 55 9 22
3 71 17 6
1 31 22 44
2 54 18 22
21 47 4 26
1 40 23 34
11 66 9 12
10 68 8 12
>> heat
heat =
78.5000
74.3000
104.3000
87.6000
95.9000
109.2000
102.7000
72.5000
93.1000
115.9000
83.8000
113.3000
109.4000
This means that you have a matrix ingredients column % of ingredients in a component
>> sum(ingredients(1,:))
ans =
99 % so it is near 100%
and the rows are the 13 measures of the prodcut and the heat vector, the heat at the observation was taken.
>> mdl = fitlm(ingredients,heat)
mdl =
Linear regression model:
y ~ 1 + x1 + x2 + x3 + x4
Estimated Coefficients:
Estimate SE tStat pValue
________ _______ ________ ________
(Intercept) 62.405 70.071 0.8906 0.39913
x1 1.5511 0.74477 2.0827 0.070822
x2 0.51017 0.72379 0.70486 0.5009
x3 0.10191 0.75471 0.13503 0.89592
x4 -0.14406 0.70905 -0.20317 0.84407
Number of observations: 13, Error degrees of freedom: 8
Root Mean Squared Error: 2.45
R-squared: 0.982, Adjusted R-Squared 0.974
F-statistic vs. constant model: 111, p-value = 4.76e-07
So in your case, it not have sense to measure for each observation separately. is simply with t the same number of elements than observations.
take a look here
mdl = fitllm(A,t)
Problem solved using sapply and findgroups !

classifier.setOptions( weka.core.Utils.splitOptions()) is taking only default values even if other values provided in matlab

import weka.core.Instances.*
filename = 'C:\Users\Girish\Documents\MATLAB\DRESDEN_NSC.csv';
loader = weka.core.converters.CSVLoader();
loader.setFile(java.io.File(filename));
data = loader.getDataSet();
data.setClassIndex(data.numAttributes()-1);
%% classification
classifier = weka.classifiers.trees.J48();
classifier.setOptions( weka.core.Utils.splitOptions('-C 0.25 -M 2') );
classifier.buildClassifier(data);
classifier.toString()
ev = weka.classifiers.Evaluation(data);
v(1) = java.lang.String('-t');
v(2) = java.lang.String(filename);
v(3) = java.lang.String('-split-percentage');
v(4) = java.lang.String('66');
prm = cat(1,v(1:4));
ev.evaluateModel(classifier, prm)
Result:
Time taken to build model: 0.04 seconds
Time taken to test model on training split: 0.01 seconds
=== Error on training split ===
Correctly Classified Instances 767 99.2238 %
Incorrectly Classified Instances 6 0.7762 %
Kappa statistic 0.9882
Mean absolute error 0.0087
Root mean squared error 0.0658
Relative absolute error 1.9717 %
Root relative squared error 14.042 %
Total Number of Instances 773
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.994 0.009 0.987 0.994 0.990 0.984 0.999 0.999 Nikon
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 Sony
0.981 0.004 0.990 0.981 0.985 0.980 0.999 0.997 Canon
Weighted Avg. 0.992 0.004 0.992 0.992 0.992 0.988 1.000 0.999
=== Confusion Matrix ===
a b c <-- classified as
306 0 2 | a = Nikon
0 258 0 | b = Sony
4 0 203 | c = Canon
=== Error on test split ===
Correctly Classified Instances 358 89.9497 %
Incorrectly Classified Instances 40 10.0503 %
Kappa statistic 0.8482
Mean absolute error 0.0656
Root mean squared error 0.2464
Relative absolute error 14.8485 %
Root relative squared error 52.2626 %
Total Number of Instances 398
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.885 0.089 0.842 0.885 0.863 0.787 0.908 0.832 Nikon
0.993 0.000 1.000 0.993 0.997 0.995 0.997 0.996 Sony
0.796 0.060 0.841 0.796 0.818 0.749 0.897 0.744 Canon
Weighted Avg. 0.899 0.048 0.900 0.899 0.899 0.853 0.938 0.867
=== Confusion Matrix ===
a b c <-- classified as
123 0 16 | a = Nikon
0 145 1 | b = Sony
23 0 90 | c = Canon
import weka.core.Instances.*
filename = 'C:\Users\Girish\Documents\MATLAB\DRESDEN_NSC.csv';
loader = weka.core.converters.CSVLoader();
loader.setFile(java.io.File(filename));
data = loader.getDataSet();
data.setClassIndex(data.numAttributes()-1);
%% classification
classifier = weka.classifiers.trees.J48();
classifier.setOptions( weka.core.Utils.splitOptions('-C 0.1 -M 1') );
classifier.buildClassifier(data);
classifier.toString()
ev = weka.classifiers.Evaluation(data);
v(1) = java.lang.String('-t');
v(2) = java.lang.String(filename);
v(3) = java.lang.String('-split-percentage');
v(4) = java.lang.String('66');
prm = cat(1,v(1:4));
ev.evaluateModel(classifier, prm)
Result:
Time taken to build model: 0.04 seconds
Time taken to test model on training split: 0 seconds
=== Error on training split ===
Correctly Classified Instances 767 99.2238 %
Incorrectly Classified Instances 6 0.7762 %
Kappa statistic 0.9882
Mean absolute error 0.0087
Root mean squared error 0.0658
Relative absolute error 1.9717 %
Root relative squared error 14.042 %
Total Number of Instances 773
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.994 0.009 0.987 0.994 0.990 0.984 0.999 0.999 Nikon
1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 Sony
0.981 0.004 0.990 0.981 0.985 0.980 0.999 0.997 Canon
Weighted Avg. 0.992 0.004 0.992 0.992 0.992 0.988 1.000 0.999
=== Confusion Matrix ===
a b c <-- classified as
306 0 2 | a = Nikon
0 258 0 | b = Sony
4 0 203 | c = Canon
=== Error on test split ===
Correctly Classified Instances 358 89.9497 %
Incorrectly Classified Instances 40 10.0503 %
Kappa statistic 0.8482
Mean absolute error 0.0656
Root mean squared error 0.2464
Relative absolute error 14.8485 %
Root relative squared error 52.2626 %
Total Number of Instances 398
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.885 0.089 0.842 0.885 0.863 0.787 0.908 0.832 Nikon
0.993 0.000 1.000 0.993 0.997 0.995 0.997 0.996 Sony
0.796 0.060 0.841 0.796 0.818 0.749 0.897 0.744 Canon
Weighted Avg. 0.899 0.048 0.900 0.899 0.899 0.853 0.938 0.867
=== Confusion Matrix ===
a b c <-- classified as
123 0 16 | a = Nikon
0 145 1 | b = Sony
23 0 90 | c = Canon
Same Result with both split options which is the result for default options i.e. -C 0.25 -M 2 for J48 classifier
please help!!! stuck here for a long time.Tried Different means but nothing worked for me

Matlab: Merge year into interval

I have a vector with decades (note that some decades are deliberately missing:
decades = [1910 1920; 1921 1930; 1931 1940; 1951 1960]
and a vector with some years (max. 1 per decade) with a piece of information (let's say accidents):
years = [1916 35; 1923 77; 1939 28; 1941 40; 1951 32]
Is there a way to combine the information to the decades, other than using a loop?
result = [1910 1920 35; 1921 1930 77; 1931 1940 28; 1951 1960 32]
Assumptions (taken from OP's comments):
a. There won't be more than one year per decade.
b. decade and years are always sorted.
Code
%// Slightly different inputs to verify the correctness of code across
%//general cases, but within the above mentioned assumptions
decades = [1910 1920; 1921 1930; 1931 1940; 1951 1960; 1971 1980]
years = [1916 35; 1939 28; 1941 40; 1951 32]
cm = bsxfun(#ge,years(:,1),decades(:,1)') & bsxfun(#le,years(:,1),decades(:,2)')
select_years = any(cm,2)
select_decades = any(cm,1)
%// If result is needed such that decades which do not have a entry in
%// years must be logged in with the third column as 0
result = [decades sum(bsxfun(#times,cm,years(:,2)))'] %//'
%// If result is needed such that decades which do not have a entry in
%// years must be skipped
result = [decades(select_decades,:) years(select_years,2)]
Output
decades =
1910 1920
1921 1930
1931 1940
1951 1960
1971 1980
years =
1916 35
1939 28
1941 40
1951 32
result =
1910 1920 35
1921 1930 0
1931 1940 28
1951 1960 32
1971 1980 0
result =
1910 1920 35
1931 1940 28
1951 1960 32
This accumulates values if there is more than one per decade. It also handles the case when a decade doesn't have any value.
[~, bin] = histc(years(:,1),reshape(decades.',[],1)); %'// find bin of each value
bin = (bin+1)/2; %// non-integers here indicate intervals between decades
bin(mod(bin,1)~=0) = size(decades,1)+1; %// values between decades: move to end
accum = accumarray(bin,years(:,2)); %// accumulate all values from each bin
result = [decades accum(1:end-1)]; %// remove end bin (values between decades)
Example with two values in one decade, and zero values in some other decade:
decades = [1910 1920; 1921 1930; 1931 1940; 1951 1960];
years = [1916 35; 1918 77; 1939 28; 1941 40; 1951 32];
result =
1910 1920 112
1921 1930 0
1931 1940 28
1951 1960 32
decades = [1910 1920; 1921 1930; 1931 1940; 1951 1960]
years = [1916 35; 1923 77; 1939 28; 1951 32]
new_decades_1=repmat(decades(:,1),1,size(decades,1))
new_decades_2=repmat(decades(:,2),1,size(decades,1))
new_years=repmat(years(:,1),1,size(years,1))
cond=(new_decades_1<=new_years') & (new_decades_2>=new_years')
[x,y]=find(cond);
result=[decades,years(x,2)]
This however requires that years and decades have the same length.