mgcv: Difference between s(x, by=cat) and s(cat, bs='re') - linear-regression

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?

The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

Related

How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
library(nlraa)
library(nlme)
library(ggplot2)
## Setting this seed should allow you to reach the same result as me
set.seed(3232333)
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
Correlation:
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
ggplot()+
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

linear combination of curves to match a single curve with integer constraints

I have a set of vectors (curves) which I would like to match to a single curve. The issue isnt only finding a linear combination of the set of curves which will most closely match the single curve (this can be done with least squares Ax = B). I need to be able to add constraints, for example limiting the number of curves used in the fitting to a particular number, or that the curves lie next to each other. These constraints would be found in mixed integer linear programming optimization.
I have started by using lsqlin which allows constraints and have been able to limit the variable to be > 0.0, but in terms of adding further constraints I am at a loss. Is there a way to add integer constraints to least squares, or alternatively is there a way to solve this with a MILP?
any help in the right direction much appreciated!
Edit: Based on the suggestion by ErwinKalvelagen I am attempting to use CPLEX and its quadtratic solvers, however until now I have not managed to get it working. I have created a minimal 'notworking' example and have uploaded the data here and code here below. The issue is that matlabs LS solver lsqlin is able to solve, however CPLEX cplexlsqnonneglin returns CPLEX Error 5002: %s is not convex for the same problem.
function [ ] = minWorkingLSexample( )
%MINWORKINGLSEXAMPLE for LS with matlab and CPLEX
%matlab is able to solve the least squares, CPLEX returns error:
% Error using cplexlsqnonneglin
% CPLEX Error 5002: %s is not convex.
%
%
% Error in Backscatter_Transform_excel2_readMut_LINPROG_CPLEX (line 203)
% cplexlsqnonneglin (C,d);
%
load('C_n_d_2.mat')
lb = zeros(size(C,2),1);
options = optimoptions('lsqlin','Algorithm','trust-region-reflective');
[fact2,resnorm,residual,exitflag,output] = ...
lsqlin(C,d,[],[],[],[],lb,[],[],options);
%% CPLEX
ctype = cellstr(repmat('C',1,size(C,2)));
options = cplexoptimset;
options.Display = 'on';
[fact3, resnorm, residual, exitflag, output] = ...
cplexlsqnonneglin (C,d);
end
I could reproduce the Cplex problem. Here is a workaround. Instead of solving the first model, use a model that is less nonlinear:
The second model solves fine with Cplex. The problem is somewhat of a tolerance/numeric issue. For the second model we have a much more well-behaved Q matrix (a diagonal). Essentially we moved some of the complexity from the objective into linear constraints.
You should now see something like:
Tried aggregator 1 time.
QP Presolve eliminated 1 rows and 1 columns.
Reduced QP has 401 rows, 443 columns, and 17201 nonzeros.
Reduced QP objective Q matrix has 401 nonzeros.
Presolve time = 0.02 sec. (1.21 ticks)
Parallel mode: using up to 8 threads for barrier.
Number of nonzeros in lower triangle of A*A' = 80200
Using Approximate Minimum Degree ordering
Total time for automatic ordering = 0.00 sec. (3.57 ticks)
Summary statistics for Cholesky factor:
Threads = 8
Rows in Factor = 401
Integer space required = 401
Total non-zeros in factor = 80601
Total FP ops to factor = 21574201
Itn Primal Obj Dual Obj Prim Inf Upper Inf Dual Inf
0 3.3391791e-01 -3.3391791e-01 9.70e+03 0.00e+00 4.20e+04
1 9.6533667e+02 -3.0509942e+03 1.21e-12 0.00e+00 1.71e-11
2 6.4361775e+01 -3.6729243e+02 3.08e-13 0.00e+00 1.71e-11
3 2.2399862e+01 -6.8231454e+01 1.14e-13 0.00e+00 3.75e-12
4 6.8012056e+00 -2.0011575e+01 2.45e-13 0.00e+00 1.04e-12
5 3.3548410e+00 -1.9547176e+00 1.18e-13 0.00e+00 3.55e-13
6 1.9866256e+00 6.0981384e-01 5.55e-13 0.00e+00 1.86e-13
7 1.4271894e+00 1.0119284e+00 2.82e-12 0.00e+00 1.15e-13
8 1.1434804e+00 1.1081026e+00 6.93e-12 0.00e+00 1.09e-13
9 1.1163905e+00 1.1149752e+00 5.89e-12 0.00e+00 1.14e-13
10 1.1153877e+00 1.1153509e+00 2.52e-11 0.00e+00 9.71e-14
11 1.1153611e+00 1.1153602e+00 2.10e-11 0.00e+00 8.69e-14
12 1.1153604e+00 1.1153604e+00 1.10e-11 0.00e+00 8.96e-14
Barrier time = 0.17 sec. (38.31 ticks)
Total time on 8 threads = 0.17 sec. (38.31 ticks)
QP status(1): optimal
Cplex Time: 0.17sec (det. 38.31 ticks)
Optimal solution found.
Objective : 1.115360
See here for some details.
Update: In Matlab this becomes:

How to find and label the most frequent values with a tiny variance in MATLAB?

I have a vector d:
d = [
1.19011941712580e-06
6.39136179286748e-06
1.26442316296575e-05
1.81039120389278e-05
1.91304903300688e-05
2.19912290910362e-05
2.94113112667430e-05
3.42238417249065e-05
4.14201181268186e-05
5.76014376298924e-05
6.81337071520188e-05
0.000108396864465101
0.000130922201344182
0.000145712942644687
0.000174386494384153
0.000262758083529471
03050975943883
0.000373066486719321
0.000423949134658855
0.000489079623696380
0.000548432526451254
0.000694787830192734
0.000881370593483890
0.00125516689720339
0.00145237435686831
0.00815957230852142
0.0210146005799470
0.0507995676939279
0.0541594307796186
1
]
Plotting d:
plot(d, 'x:')
In this situation, [M, F] = mode(d) gives a result that I didn't want.
Is there any function that counts the most frequent values which takes a sort of tolerance into account?
Clustering can be considered. However, In the figure above, clustering may assign d(27:29) into the left side cluster.
Current approach is normilizing and thresholding:
d_norm = d / max(d);
v = d_norm(d_norm < 0.01); % 1 percent threshold
However, I think it is a sort of hard-coded and not a good approach.
Histcounts is your friend!
Your "threshold" can be easily translated to histogram bins if you know your range. In your case, the range is 0-1, if you choose a threshold of 0.01 then 100 is your bin count.
counts=histcounts(d,100)

MATLAB: How can I create autocorrelated data?

I'm looking to create a vector of autocorrelated data points in MATLAB, with the lag 1 higher than lag 2, and so on.
If I look at the lag 1 data pairs (1, 2), (3, 4), (5, 6), ..., then the correlation is relatively higher, but then at lag 2 it's reduced.
I found a way to do this in R
x <- filter(rnorm(1000), filter=rep(1,3), circular=TRUE)
However, I'm not sure how to do the same thing in MATLAB. Ideally I'd like to be able to fine tune exactly how autocorrelated the data is.
Math:
A group of standard models for autocorrelation in stationary time series are so called "auto regressive model" eg. an autoregressive model with 1 term is known as an AR(1) and is:
y_t = a + b*y_{t-1} + e_t
AR(1) sounds simplistic, but it turns it's a quite powerful tooll. Eg. an AR(p) with p autoregressive terms is actually an AR(1) on a p dimensional vector. (Check Wikipedia page.) Note also b=1, gives a non-stationary random walk.
A more intuitive way to write what's going on (in stationary case with |b| < 1) is define u = a / (1 - b) (turns out u is unconditional mean of AR(1)), then with some algebra:
y_t - u = b * ( y_{t-1} - u) + e_t
That is, the difference from the unconditional mean u gets hit with some decay term b and then a shock term e_t gets added. (you want -1<b<1 for stationarity)
Code:
Since e_t denotes the shock term, this is super easy to simulate. Eg. to simulate an AR(1):
a = 0; b = .4; sigma = 1; T = 1000;
y0 = a / (1 - b); %eg initialize to unconditional mean of stationary time series
y = zeros(T,1);
y(1) = a + b * y0 + randn() * sigma;
for t = 2:T
y(t) = a + b * y(t-1) + randn() * sigma;
end
This code isn't mean to be fast, but illustrative. An AR(1) model implies a certain type of correlation structure, but adding AR or MA terms, you can fit some pretty funky stuff. (MA is moving average model)
Can test sample autocorrelation with autocorr(y). For reference, the bible on time series mathematics is Hamilton's book Time Series Analysis.

How to estimate goodness-of-fit using scipy.odr?

I am fitting data with weights using scipy.odr but I don't know how to obtain a measure of goodness-of-fit or an R squared. Does anyone have suggestions for how to obtain this measure using the output stored by the function?
The res_var attribute of the Output is the so-called reduced Chi-square value for the fit, a popular choice of goodness-of-fit statistic. It is somewhat problematic for non-linear fitting, though. You can look at the residuals directly (out.delta for the X residuals and out.eps for the Y residuals). Implementing a cross-validation or bootstrap method for determining goodness-of-fit, as suggested in the linked paper, is left as an exercise for the reader.
The output of ODR gives both the estimated parameters beta as well as the standard deviation of those parameters sd_beta. Following p. 76 of the ODRPACK documentation, you can convert these values into a t-statistic with (beta - beta_0) / sd_beta, where beta_0 is the number that you're testing significance with respect to (often zero). From there, you can use the t-distribution to get the p-value.
Here's a working example:
import numpy as np
from scipy import stats, odr
def linear_func(B, x):
"""
From https://docs.scipy.org/doc/scipy/reference/odr.html
Linear function y = m*x + b
"""
# B is a vector of the parameters.
# x is an array of the current x values.
# x is in the same format as the x passed to Data or RealData.
#
# Return an array in the same format as y passed to Data or RealData.
return B[0] * x + B[1]
np.random.seed(0)
sigma_x = .1
sigma_y = .15
N = 100
x_star = np.linspace(0, 10, N)
x = np.random.normal(x_star, sigma_x, N)
# the true underlying function is y = 2*x_star + 1
y = np.random.normal(2*x_star + 1, sigma_y, N)
linear = odr.Model(linear_func)
dat = odr.Data(x, y, wd=1./sigma_x**2, we=1./sigma_y**2)
this_odr = odr.ODR(dat, linear, beta0=[1., 0.])
odr_out = this_odr.run()
# degrees of freedom are n_samples - n_parameters
df = N - 2 # equivalently, df = odr_out.iwork[10]
beta_0 = 0 # test if slope is significantly different from zero
t_stat = (odr_out.beta[0] - beta_0) / odr_out.sd_beta[0] # t statistic for the slope parameter
p_val = stats.t.sf(np.abs(t_stat), df) * 2
print('Recovered equation: y={:3.2f}x + {:3.2f}, t={:3.2f}, p={:.2e}'.format(odr_out.beta[0], odr_out.beta[1], t_stat, p_val))
Recovered equation: y=2.00x + 1.01, t=239.63, p=1.76e-137
One note of caution in using this approach on nonlinear problems, from the same ODRPACK docs:
"Note that for nonlinear ordinary least squares, the linearized confidence regions and intervals are asymptotically correct as n → ∞ [Jennrich, 1969]. For the orthogonal distance regression problem, they have been shown to be asymptotically correct as σ∗ → 0 [Fuller, 1987]. The difference between the conditions of asymptotic correctness can be explained by the fact that, as the number of observations increases in the orthogonal distance regression problem one does not obtain additional information for ∆. Note also that Vˆ is dependent upon the weight matrix Ω, which must be assumed to be correct, and cannot be confirmed from the orthogonal distance regression results. Errors in the values of wǫi and wδi that form Ω will have an adverse affect on the accuracy of Vˆ and its component parts. The results of a Monte Carlo experiment examining the accuracy
of the linearized confidence intervals for four different measurement error models is presented in [Boggs and Rogers, 1990b]. Those results indicate that the confidence regions and intervals for ∆ are not as accurate as those for β.
Despite its potential inaccuracy, the covariance matrix is frequently used to construct confidence regions and intervals for both nonlinear ordinary least squares and measurement error models because the resulting regions and intervals are inexpensive to compute, often adequate, and familiar to practitioners. Caution must be exercised when using such regions and intervals, however, since the validity of the approximation will depend on the nonlinearity of the model, the variance and distribution of the errors, and the data itself. When more reliable intervals and regions are required, other more accurate methods should be used. (See, e.g., [Bates and Watts, 1988], [Donaldson and Schnabel, 1987], and [Efron, 1985].)"
As mentioned by R. Ken, chi-square or variance of the residuals is one of the more
commonly used tests of goodness of fit. ODR stores the sum of squared
residuals in out.sum_square and you can verify yourself
that out.res_var = out.sum_square/degrees_freedom corresponds to what is commonly called reduced chi-square: i.e. the chi-square test result divided by its expected value.
As for the other very popular estimator of goodness of fit in linear regression, R squared and its adjusted version, we can define the functions
import numpy as np
def R_squared(observed, predicted, uncertainty=1):
""" Returns R square measure of goodness of fit for predicted model. """
weight = 1./uncertainty
return 1. - (np.var((observed - predicted)*weight) / np.var(observed*weight))
def adjusted_R(x, y, model, popt, unc=1):
"""
Returns adjusted R squared test for optimal parameters popt calculated
according to W-MN formula, other forms have different coefficients:
Wherry/McNemar : (n - 1)/(n - p - 1)
Wherry : (n - 1)/(n - p)
Lord : (n + p - 1)/(n - p - 1)
Stein : (n - 1)/(n - p - 1) * (n - 2)/(n - p - 2) * (n + 1)/n
"""
# Assuming you have a model with ODR argument order f(beta, x)
# otherwise if model is of the form f(x, a, b, c..) you could use
# R = R_squared(y, model(x, *popt), uncertainty=unc)
R = R_squared(y, model(popt, x), uncertainty=unc)
n, p = len(y), len(popt)
coefficient = (n - 1)/(n - p - 1)
adj = 1 - (1 - R) * coefficient
return adj, R
From the output of your ODR run you can find the optimal values for your model's parameters in out.beta and at this point we have everything we need for computing R squared.
from scipy import odr
def lin_model(beta, x):
"""
Linear function y = m*x + q
slope m, constant term/y-intercept q
"""
return beta[0] * x + beta[1]
linear = odr.Model(lin_model)
data = odr.RealData(x, y, sx=sigma_x, sy=sigma_y)
init = odr.ODR(data, linear, beta0=[1, 1])
out = init.run()
adjusted_Rsq, Rsq = adjusted_R(x, y, lin_model, popt=out.beta)