temporal autocorrelation and perfect fit in glmm - mixed-models

I get some troubles with temporal autocorrelation and the way to implement it in glmm.
Here (https://cran.r-project.org/web/packages/glmmTMB/vignettes/covstruct.html), with an autoregressive process (AR1), they fit time series with one point by time step.
However, when I am doing that I get surprisingly high R², with simulated data and with empirical data,for example:
>library(glmmTMB)
>library(MASS)
>library(data.table)
>myfile <- "https://raw.githubusercontent.com/f-duchenne/data_for_test/master/data_for_test.txt"
>dat<- fread(myfile)
>dat$Annee=numFactor(dat$Annee)
>model <- glmmTMB(pheno_derivs2 ~ ratio+urban+ar1(Annee + 0 | species), data=dat)
>MuMIn::r.squaredGLMM(model)
R2m R2c
[1,] 0.002107069 1
1-var(residuals(model))/var(dat$pheno_derivs2)
[1] 1
cor(dat$pheno_derivs2,predict(model))^2
[1] 1
How is it possible that the model fits perfectly the data while the fixed effect is far from overfitting ?
Is it normal that including the temporal autocorrelation process gives such R² and almost a perfect fit ? (largely due to the random part, fixed part often explains a small part of the variance in my data). Is the model still interpretable ?

Related

why the results from the joint_tests function (emmeans package) do not show one of the interactions of the model?

I run a GLMM_adaptive model (I am doing a resource selection function) and I am using the joint_tests function (emmeans package) to compute joint tests of the terms in the model. The problem is that one of the interactions does not appear in the results.
The model is:
mod.hinc <- mixed_model(fixed = Used ~ scale(ndvi) * season * vegfactor +
scale(ndvi^2) + scale(distance^2) + scale(distance) * season,
random = ~ 1 | id, data = hin.c,
family = binomial(link="logit"))
After running the model I run the joint_tests function:
install.packages("emmeans")
library(emmeans)
joint_tests(mod.hinc)
And this is the result:
joint_tests(mod.hinc)
model term df1 df2 F.ratio p.value
ndvi 1 Inf 36.465 <.0001
season 3 Inf 22.265 <.0001
vegfactor 4 Inf 4.548 0.0011
distance 1 Inf 33.939 <.0001
ndvi:season 3 Inf 13.826 <.0001
ndvi:vegfactor 4 Inf 8.500 <.0001
season:vegfactor 12 Inf 6.544 <.0001
ndvi:season:vegfactor 12 Inf 5.165 <.0001
I cannot find the reason why the interaction scale(distance)*season does not appear in the results.
Any help on that issue is welcome. I can provide more details about the model if is required.
Thank you very much in advance.
Juan
The short answer is that distance:season is not shown because it came up with zero d.f. for the associated interaction contrasts. You could verify this by running joint_tests(mod.hinc, show0df = TRUE).
Why it has 0 d.f. is less clear. However, that is not the only problem here. You have to be extremely careful with numeric predictors when using joint_tests(); it does not do a model ANOVA; instead, as documented, it constructs a reference grid from the fitted model and performs joint tests of interaction contrasts related to the predictors. With numeric predictors, the results depend on the reference grid used.
In this particular instance, the model includes quadratic effects of ndvi and distance; however, the default reference grid is constructed using the range of the covariates -- only two distinct values. Thus, we can pick up the effects of the overall linear trends, but not the curvature effects implied by the quadratic terms. That's why only 1 d.f. of those factors' main effects are tested. There are really 2 d.f. in the effects of ndvi and distance. In order to capture all of those effects, we need to have at least three distinct values of these covariates in the reference grid. One way (not the only way) to accomplish that is to reduce the covariates to their means, plus or minus 1 SD -- which can be accomplished via this code:
meanpm1sd <- function(x)
c(mean(x) - sd(x), mean(x), mean(x) + sd(x))
joint_tests(mod.hinc, cov.reduce = meanpm1sd)
This will yield a different set of joint tests that likely will include 2-d.f. tests of ndvi and distance. But I don't know if you will still have some interactions missing due to zero-d.f. dimensionalities.
You can look directly at the estimates being tested in detail if you have any questions about what those effects are. For example, for season:distance,
### construct the needed reference grid once and for all
RG <- ref_grid(mod.h1nc, cov.reduce = meanpm1sd)
EMM <- emmeans(RG, ~ season * distance)
CON <- contrast(EMM, interaction = "consec")
EMM ### see estimates
CON ### see interaction contrasts
test(CON, joint = TRUE)
I hope this helps shed some light on what is going on.

Why is the confidence interval not consistent with the standard errors in this regression?

I am running a linear regression with fixed effect and standard errors clustered by a certain group.
areg ref1 ref1_l1 rf1 ew1 vol_ew1 sk_ew1, a(us_id) vce(cluster us_id)
The one line code is as above and the output is as follows:
Now, the t-stats and the P values look inconsistent. How can we have t-stat >5 and pval >11%?. Similarly the 95% confidence intervals appear to be way wider than Coeff. +- 2 Std. Err.
What am I missing?
There is nothing inconsistent here. You have a small sample size and a less than parsimonious model and have all but run out of degrees of freedom. Notice how areg won't post an F statistic or a P-value for the model, a strong danger sign. Your t statistics are consistent with checks by hand:
. display 2 * ttail(1, 5.54)
.11368912
. display 2 * ttail(1, 113.1)
.00562868
In short, there is no bug here and no programming issue. It's just a matter of your model over-fitting your data and the side-effects of that.
Similarly, +/- 2 SE for a 95% confidence interval is way off as a rule of thumb here. Again, a hand calculation is instructive:
. display invt(1, 0.975)
12.706205
. display invt(60, 0.975)
2.0002978
. display invt(61, 0.975)
1.9996236
. display invnormal(0.975)
1.959964

understand and coding of the zero-lag cross-correlation matlab

First of all, I am sorry if I am a dummy and cant understand this part of an article. I have a set of data with 200 channels in which every specific two channels are co-dependent. In the paper it is mentioned:
"
For each channel, we filtered both signals
between 0.5 and 2.5 Hz to preserve only the cardiac component and
normalized the resulting signals to balance any difference between
their amplitude.
"
Question1: this means I need to normalize both co-dependent channels to the average of the median? or just normalize each signal to its own median?
Here is the rest of the paragraph
"
Then, we computed the cross-correlation
extracted the value at a time lag of 0 to quantify the similarity
between the filtered signals. In-phase and counter-phase identical
waveforms yielded a zero-lag cross-correlation value of 1 and -1
respectively, whereas a null value derived from totally uncorrelated
signals. "
I wrote the code below: but I get -1 or plus one everywhere even for signals that are not codependent it gives me 1 or -1. I guess I am wrong in part of the code but rationally I don't know where. Here is the code
datafile='data_sess_03.nirs'
ch_num=1
[w,src,det,mlOrg,mlo,mlm,Data,datap,acc1,acc2]=readData(datafile);
fc=[0.5 2.5];
dataf=filterData(Data,fc);
[c,lags]=xcorr(dataf(1,:),dataf(5,:),0); % channel 1 and 5 are
codependent
%% c is -1 and plus one every when even in the noisy channels
plot(acor,'black')
[~,I] = max(abs(acor));
lagDiff = lag(I)/fs
Any help will be really appreciated. Thanks a lot to help me

MSE in neuralnet results and roc curve of the results

Hi my question is a bit long please bare and read it till the end.
I am working on a project with 30 participants. We have two type of data set (first data set has 30 rows and 160 columns , and second data set has the same 30 rows and 200 columns as outputs=y and these outputs are independent), what i want to do is to use the first data set and predict the second data set outputs.As first data set was rectangular type and had high dimension i have used factor analysis and now have 19 factors that cover up to 98% of the variance. Now i want to use these 19 factors for predicting the outputs of the second data set.
I am using neuralnet and backpropogation and everything goes well and my results are really close to outputs.
My questions :
1- as my inputs are the factors ( they are between -1 and 1 ) and my outputs scale are between 4 to 10000 and integer , should i still scaled them before running neural network ?
2-I scaled the data ( both input and outputs ) and then predicted with neuralnet , then i check the MSE error it was so high like 6000 while my prediction and real output are so close to each other. But if i rescale the prediction and outputs then check The MSE its near zero. Is it unbiased to rescale and then check the MSE ?
3- I read that it is better to not scale the output from the beginning but if i just scale the inputs all my prediction are 1. Is it correct to not to scale the outputs ?
4- If i want to plot the ROC curve how can i do it. Because my results are never equal to real outputs ?
Thank you for reading my question
[edit#1]: There is a publication on how to produce ROC curves using neural network results
http://www.lcc.uma.es/~jja/recidiva/048.pdf
1) You can scale your values (using minmax, for example). But only scale your training data set. Save the parameters used in the scaling process (in minmax they would be the min and max values by which the data is scaled). Only then, you can scale your test data set WITH the min and max values you got from the training data set. Remember, with the test data set you are trying to mimic the process of classifying unseen data. Unseen data is scaled with your scaling parameters from the testing data set.
2) When talking about errors, do mention which data set the error was computed on. You can compute an error function (in fact, there are different error functions, one of them, the mean squared error, or MSE) on the training data set, and one for your test data set.
4) Think about this: Let's say you train a network with the testing data set,and it only has 1 neuron in the output layer . Then, you present it with the test data set. Depending on which transfer function (activation function) you use in the output layer, you will get a value for each exemplar. Let's assume you use a sigmoid transfer function, where the max and min values are 1 and 0. That means the predictions will be limited to values between 1 and 0.
Let's also say that your target labels ("truth") only contains discrete values of 0 and 1 (indicating which class the exemplar belongs to).
targetLabels=[0 1 0 0 0 1 0 ];
NNprediction=[0.2 0.8 0.1 0.3 0.4 0.7 0.2];
How do you interpret this?
You can apply a hard-limiting function such that the NNprediction vector only contains the discreet values 0 and 1. Let's say you use a threshold of 0.5:
NNprediction_thresh_0.5 = [0 1 0 0 0 1 0];
vs.
targetLabels =[0 1 0 0 0 1 0];
With this information you can compute your False Positives, FN, TP, and TN (and a bunch of additional derived metrics such as True Positive Rate = TP/(TP+FN) ).
If you had a ROC curve showing the False Negative Rate vs. True Positive Rate, this would be a single point in the plot. However, if you vary the threshold in the hard-limit function, you can get all the values you need for a complete curve.
Makes sense? See the dependencies of one process on the others?

Match Two Sets of Measurement Data With Different Logging Start Times and End Times

Problem
I have two arrays (Xa and Xb) that contain measurements of the same physical signal, but they are taken at different sample rates. Lastly, physical logging of Xa data starts at a different time, than that of Xb. The logging of data also stops at different time.
i.e.
(The following is just a summary of important statements, not code.)
sampleRatea > sampleRateb % Resolution of Xa is greater than that of Xb
t0a ~= t0b % Start times are not equal
t1a ~= t1b % End times are not equal
Objective
Find the necessary shift in indices that will best line up these sets of data.
Approach
Use fmincon to find the index that minimizes the mean squared error (MSE) between versions Xa and Xb that are edited to have the same sample rate (perhaps using the interpolation function).
I have tried to do this but it always seems that I have too many degrees of freedom. Is there anyone who can shed some light on a process that might facilitate this process?
Assuming you have two samples with constant frequencies, the problem reduces to something quite simple:
Find scale, location such that:
Xa , at timestamps corresponding to its index, makes the best match with Xb at timstamps corresponding to location + scale * its index.
If you agree with this you can see that only two degrees of freedom are left, if you know the ratio of sample rates it even reduces to just 1 degree of freedom.
I believe that now the hard part is done, but some work still remains:
Judge how good two samples with timestamps and values match
Find the optimal combination of your location and scale parameter
Note that, assuming you complete these 2 steps properly, the solution should be optimal for finding the optimal timestamps. As you are looking for a shift in (integer) indices, translating these timestamps back to indices may not be result in the real optimum but it should be pretty close.
Here is a quick-and-dirty solution that should be enough to get you started. Given your input signals Xa and Xb sampled at sampleRatea and sampleRateb respectively:
g = gcd(sampleRatea,sampleRateb);
Ya = interp(Xa,sampleRateb/g);
Yb = interp(Xb,sampleRatea/g);
Yfs = sampleRatea*sampleRateb/g;
[acor,lag] = xcorr(Ya,Yb);
time_shift = lag(acor == max(acor))/Yfs;
The variable time_shift will tell you the time elapsed between the start of A and the start of B. If B starts first, the result will be negative.
If your sampling rates are relatively prime, this will be horribly inefficient. If one is an integer multiple of the other, or they have a relatively large GCD, it will be much better.