Residuals in TraMineR's seqecmpgroup function - traminer

In the TraMineR package, the seqecmpgroup function output is a table, for example:
Subsequence Support p.value statistic index Freq.Low Freq.Medium Freq.High Resid.Low Resid.Medium Resid.High
1 A-B 0.2685714 0.0000213 21.516234 456 0.1619048 0.2714286 0.3619048 -2.9826924 0.17258979 2.60985584
2 A-C-D 0.2714286 0.0005804 14.903681 495 0.1952381 0.2683673 0.3619048 -2.1192518 -0.1839418 2.51661148
3 C-D-F 0.2614286 0.0013223 13.256739 492 0.1619048 0.2744898 0.3 -2.8207209 0.79968768 1.09319805
Are the residuals in the table the standardized residuals?

The residuals provided by seqecmpgroup are Pearson residuals returned as residuals by the chisq.test function. They correspond to (observed - expected)/sqrt(expected), which is the square root of the contribution of each corresponding cell to the Pearson Chi-squared. These are NOT the standardized residuals that chisq.test returns as stdres.

Related

mgcv: Difference between s(x, by=cat) and s(cat, bs='re')

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?
The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

Solve for 3 variables with a quite complex equation in Matlab

syms omega j theta sigma
k=[90 94 98 102 106 110]; % Strike prices
putprices=[.1606 .2806 .9285 2.8801 6.1523 10.0147]; % Put prices at each strike
mustar = log(100)-sigma.^2/2-omega.*(exp(-theta)-1); % risk-neutral mean
Es1j=exp(mustar-theta.*j+sigma.^2./2); % Expected value of stock price at state j 1 period from now
x1=(log(k)-(mustar-theta.*j))./sigma; % the first normcdf value
x2=(log(k)-(mustar-theta.*j)-sigma.^2)./sigma; % the second normcdf value
qpkandj = k.*normcdf(x1)-Es1j.*normcdf(x2); % combination of normcdfs and strike prices and expected value of stock price 1 peroid from now
qpk=exp(-omega).*omega.^j./factorial(j).*qpkandj; % Poisson formula
eqns = symsum(qpk,j,0,inf)== putprices % Continuation of Poisson formula
assume(omega > 0)
assume(theta, 'real')
assume(sigma, 'real')
S = solve(eqns, [omega theta sigma], 'Real', true, 'ReturnConditions', true)
S.omega
S.theta
S.sigma
Trying to work on a Poisson mixture model to find the values of the three parameters (omega, theta, and sigma) that match the Poisson equation's outputs to the putprices in the third line. Whenever I run this, I get a very, very long conditions list and I tried to put it all into the assume condition and it made no difference. Anyone have any ideas what's going on or how I could manage to make this work?
Matlab will print the condition list because you don't add a semicolon. To avoid print the condition list, just add semicolon like this:
eqns = symsum(qpk,j,0,inf) == putprices; % Continuation of Poisson formula

looping through column data to fit curve in matlab

I have a spreadsheet ANA2009_1ag.xlsx with column 1= wavelength lambda and col2=specta1, col3=spectra2, col4=spectra3 etc
I want to fit a curve to each of the spectra using the equation:
ag(lambda) = ag(lambda_o)*exp(-S(lambda-lambda_o)
So I can get the slope S.
I have the code and I am able to run it if I have a text file with 2 columns:
column 1= wavelength lambda and col2=specta1
But as i said above I have several spectra so i would like to make a loop so that the routine i got for fitting the curve and getting the slope loops from one column to the next. ut as a novice in matlab loops are my biggest problem.
My code is as follows.
%//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
[num,txt,raw]=xlsread('ANA2009_1ag.xlsx');
%// Get the size of data to be plotted
[r,c]=size(num);
%//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%// using txt vector to make appropriate legend that matches the data plot
Sepctra=cellstr(txt); %//C = cellstr converts array to a cell array.
%//FIND OUT HOW YOU CAN SAY TO END OF COLUMN
for i=2:7
Sepctra(i)=txt(i);
end
%//%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%//takeout range of interest
I=find(num(:,1)<650 & num(:,1)>400);
wl=num(I,1);
a_g=num(I,2);
%//setting options for fmisearch
opts = optimset('fminsearch');
opts = optimset(opts,'MaxIter',4000);
opts = optimset(opts,'MaxFunEvals',2000); %// usually 100*number of params
opts = optimset(opts,'TolFun',1e-9);
%//opts = optimset('LevenbergMarquardt','on');
%//guess for paramters (amplitude at 532 and slope)
x0=[0.1, 0.03];
%//minimization routine
x1 = fminsearch(#least_squares,x0,opts,a_g,wl)
%//plot data and fit
plot(wl, a_g, '.k', wl, x1(1)*exp(-x1(2)*(wl-412)),'b')
DATA IS FROM 300 NM (LAMBDA) TO 685 NM (lambda) BUT I CHOSE TO FIT THE CURVE FROM 650-400NM
Wavelength AN10070A-4m AN10066A-10m
300 1.561434 1.434769
300.5 1.549919 1.42786
301 1.531495 1.414042
301.5 1.506162 1.400224
302 1.483132 1.386406
302.5 1.467011 1.372588
303 1.45089 1.356467
303.5 1.443981 1.342649
304 1.42786 1.333437
304.5 1.414042 1.324225
305 1.407133 1.31271
My expected output is:
x1 which is a vector of the amplitude at 412nm and the slope S
I would like to post the figure that i got for the curve fit but I dont know how to do it.

Getting rank deficient warning when using regress function in MATLAB

I have a dataset comprising of 30 independent variables and I tried performing linear regression in MATLAB R2010b using the regress function.
I get a warning stating that my matrix X is rank deficient to within machine precision.
Now, the coefficients I get after executing this function don't match with the experimental one.
Can anyone please suggest me how to perform the regression analysis for this equation which is comprising of 30 variables?
Going with our discussion, the reason why you are getting that warning is because you have what is known as an underdetermined system. Basically, you have a set of constraints where you have more variables that you want to solve for than the data that is available. One example of an underdetermined system is something like:
x + y + z = 1
x + y + 2z = 3
There are an infinite number of combinations of (x,y,z) that can solve the above system. For example, (x, y, z) = (1, −2, 2), (2, −3, 2), and (3, −4, 2). What rank deficient means in your case is that there is more than one set of regression coefficients that would satisfy the governing equation that would describe the relationship between your input variables and output observations. This is probably why the output of regress isn't matching up with your ground truth regression coefficients. Though it isn't the same answer, do know that the output is one possible answer. By running through regress with your data, this is what I get if I define your observation matrix to be X and your output vector to be Y:
>> format long g;
>> B = regress(Y, X);
>> B
B =
0
0
28321.7264417536
0
35241.9719076362
899.386999172398
-95491.6154990829
-2879.96318251964
-31375.7038251919
5993.52959752106
0
18312.6649115112
0
0
8031.4391233753
27923.2569044728
7716.51932560781
-13621.1638587172
36721.8387047613
80622.0849069525
-114048.707780113
-70838.6034825939
-22843.7931997405
5345.06937207617
0
106542.307940305
-14178.0346010715
-20506.8096166108
-2498.51437396558
6783.3107243113
You can see that there are seven regression coefficients that are equal to 0, which corresponds to 30 - 23 = 7. We have 30 variables and 23 constraints to work with. Be advised that this is not the only possible solution. regress essentially computes the least squared error solution such that sum of residuals of Y - X*B has the least amount of error. This essentially simplifies to:
B = X^(*)*Y
X^(*) is what is known as the pseudo-inverse of the matrix. MATLAB has this available, and it is called pinv. Therefore, if we did:
B = pinv(X)*Y
We get:
B =
44741.6923363563
32972.479220139
-31055.2846404536
-22897.9685877566
28888.7558524005
1146.70695371731
-4002.86163441217
9161.6908044046
-22704.9986509788
5526.10730457192
9161.69080479427
2607.08283489226
2591.21062004404
-31631.9969765197
-5357.85253691504
6025.47661106009
5519.89341411127
-7356.00479046122
-15411.5144034056
49827.6984426955
-26352.0537850382
-11144.2988973666
-14835.9087945295
-121.889618144655
-32355.2405829636
53712.1245333841
-1941.40823106236
-10929.3953469692
-3817.40117809984
2732.64066796307
You see that there are no zero coefficients because pinv finds the solution using the L2-norm, which promotes the "spreading" out of the errors (for a lack of a better term). You can verify that these are correct regression coefficients by doing:
>> Y2 = X*B
Y2 =
16.1491563400241
16.1264219600856
16.525331600049
17.3170318001845
16.7481541301999
17.3266932502295
16.5465094100486
16.5184456100487
16.8428701100165
17.0749421099829
16.7393450000517
17.2993993099419
17.3925811702017
17.3347117202356
17.3362798302375
17.3184486799219
17.1169638102517
17.2813552099096
16.8792925100727
17.2557945601102
17.501873690151
17.6490477001922
17.7733493802508
Similarly, if we used the regression coefficients from regress, so B = regress(Y,X); then doing Y2 = X*B, we get:
Y2 =
16.1491563399927
16.1264219599996
16.5253315999987
17.3170317999969
16.7481541299967
17.3266932499992
16.5465094099978
16.5184456099983
16.8428701099975
17.0749421099985
16.7393449999981
17.2993993099983
17.3925811699993
17.3347117199991
17.3362798299967
17.3184486799987
17.1169638100025
17.281355209999
16.8792925099983
17.2557945599979
17.5018736899983
17.6490476999977
17.7733493799981
There are some slight computational differences, which is to be expected. Similarly, we can also find the answer by using mldivide:
B = X \ Y
B =
0
0
28321.726441712
0
35241.9719075889
899.386999170666
-95491.6154989513
-2879.96318251572
-31375.7038251485
5993.52959751295
0
18312.6649114859
0
0
8031.43912336425
27923.2569044349
7716.51932559712
-13621.1638586983
36721.8387047123
80622.0849068411
-114048.707779954
-70838.6034824987
-22843.7931997086
5345.06937206919
0
106542.307940158
-14178.0346010521
-20506.8096165825
-2498.51437396236
6783.31072430201
You can see that this curiously matches up with what regress gives you. That's because \ is a more smarter operator. Depending on how your matrix is structured, it finds the solution to the system by a different method. I'd like to defer you to the post by Amro that talks about what algorithms mldivide uses when examining the properties of the input matrix being operated on:
How to implement Matlab's mldivide (a.k.a. the backslash operator "\")
What you should take away from this answer is that you can certainly go ahead and use those regression coefficients and they will more or less give you the expected output for each value of Y with each set of inputs for X. However, be warned that those coefficients are not unique. This is apparent as you said that you have ground truth coefficients that don't match up with the output of regress. It isn't matching up because it generated another answer that satisfies the constraints you have provided.
There is more than one answer that can describe that relationship if you have an underdetermined system, as you have seen by my experiments shown above.

Curve Fitting error for two arrays

I have two arrays:
x=[ 2180000; 6070000; 12700000; 16600000; 25000000; 29400000; 32200000; 36100000; 41600000; 43900000; 47200000; 55000000 ]
y= [ 1.0e-07*0.0953; 1.0e-07*0.1060; 1.0e-07*0.1220; 1.0e-07*0.1290; 1.0e-07*0.1480; 1.0e-07*0.1580; 1.0e-07*0.1880; 1.0e-07*0.2510; 1.0e-07*0.3210; 1.0e-07*0.3510; 1.0e-07*0.4470; 1.0e-07*0.6090]
I used this custom equation:
y=a*exp(-b/(4.14e-21))*exp(((3.2e-9)*(x))/((2)*(1.38e-23)*(300)))
The problem that I have is:
NaN computed by model function, fitting cannot continue. Try using or tightening upper and lower bounds on coefficients.