Formal quotation of smooth random terms in mgcv::gam mixed model - mixed-models

I have a mgcv::gam mixed model of the form:
m1 <- gam(Y ~ A + s(B, bs = "re"), data = dataframe, family = gaussian,
method = "REML")
The random term s(B, bs = "re") is quoted in summary(m1) as, for example,
Approximate significance of smooth terms:
# edf Ref.df F p-value
s(B) 4.486 5 97.195 6.7e-08 ***
My question is, how would I quote this result (statistic and P value) in a formal document, for example a technical report or paper?
For example, one possibility is
F[4.486,5] = 97.195, P = 6.7e-08
However, arguing against this idea, “reverse engineering” of the result using
pf(q= 97.195, df1= 4.486, df2= 5, lower.tail=FALSE)
gives an incorrect p value:
[1] 5.931567e-05
I would be very grateful for your advice. Many thanks for your help!

The F statistic in question doesn't actually follow an F with the degrees of freedom you have identified. The Ref df one is related to the test, but you'd need to read and understand Wood (2013) to fully grep how the degrees of freedom for the test are derived.
I would simply quote the statistic and the p-value and then cite Simon's paper if anyone wants to know how they were computed. I don't think you can easily get at the degrees of freedom that actually get used. (well, not without debugging the summary.gam() code and seeing how they are computed.)
References
Wood, S. N. 2013. A simple test for random effects in regression models. Biometrika 100: 1005–1010. doi:10.1093/biomet/ast038

Related

why the results from the joint_tests function (emmeans package) do not show one of the interactions of the model?

I run a GLMM_adaptive model (I am doing a resource selection function) and I am using the joint_tests function (emmeans package) to compute joint tests of the terms in the model. The problem is that one of the interactions does not appear in the results.
The model is:
mod.hinc <- mixed_model(fixed = Used ~ scale(ndvi) * season * vegfactor +
scale(ndvi^2) + scale(distance^2) + scale(distance) * season,
random = ~ 1 | id, data = hin.c,
family = binomial(link="logit"))
After running the model I run the joint_tests function:
install.packages("emmeans")
library(emmeans)
joint_tests(mod.hinc)
And this is the result:
joint_tests(mod.hinc)
model term df1 df2 F.ratio p.value
ndvi 1 Inf 36.465 <.0001
season 3 Inf 22.265 <.0001
vegfactor 4 Inf 4.548 0.0011
distance 1 Inf 33.939 <.0001
ndvi:season 3 Inf 13.826 <.0001
ndvi:vegfactor 4 Inf 8.500 <.0001
season:vegfactor 12 Inf 6.544 <.0001
ndvi:season:vegfactor 12 Inf 5.165 <.0001
I cannot find the reason why the interaction scale(distance)*season does not appear in the results.
Any help on that issue is welcome. I can provide more details about the model if is required.
Thank you very much in advance.
Juan
The short answer is that distance:season is not shown because it came up with zero d.f. for the associated interaction contrasts. You could verify this by running joint_tests(mod.hinc, show0df = TRUE).
Why it has 0 d.f. is less clear. However, that is not the only problem here. You have to be extremely careful with numeric predictors when using joint_tests(); it does not do a model ANOVA; instead, as documented, it constructs a reference grid from the fitted model and performs joint tests of interaction contrasts related to the predictors. With numeric predictors, the results depend on the reference grid used.
In this particular instance, the model includes quadratic effects of ndvi and distance; however, the default reference grid is constructed using the range of the covariates -- only two distinct values. Thus, we can pick up the effects of the overall linear trends, but not the curvature effects implied by the quadratic terms. That's why only 1 d.f. of those factors' main effects are tested. There are really 2 d.f. in the effects of ndvi and distance. In order to capture all of those effects, we need to have at least three distinct values of these covariates in the reference grid. One way (not the only way) to accomplish that is to reduce the covariates to their means, plus or minus 1 SD -- which can be accomplished via this code:
meanpm1sd <- function(x)
c(mean(x) - sd(x), mean(x), mean(x) + sd(x))
joint_tests(mod.hinc, cov.reduce = meanpm1sd)
This will yield a different set of joint tests that likely will include 2-d.f. tests of ndvi and distance. But I don't know if you will still have some interactions missing due to zero-d.f. dimensionalities.
You can look directly at the estimates being tested in detail if you have any questions about what those effects are. For example, for season:distance,
### construct the needed reference grid once and for all
RG <- ref_grid(mod.h1nc, cov.reduce = meanpm1sd)
EMM <- emmeans(RG, ~ season * distance)
CON <- contrast(EMM, interaction = "consec")
EMM ### see estimates
CON ### see interaction contrasts
test(CON, joint = TRUE)
I hope this helps shed some light on what is going on.

Johansen test on two stocks (for pairs trading) yielding weird results

I hope you can help me with this one.
I am using cointegration to discover potential pairs trading opportunities within stocks and more precisely I am utilizing the Johansen trace test for only two stocks at a time.
I have several securities, but for each test I only test two at a time.
If two stocks are found to be cointegrated using the Johansen test, the idea is to define the spread as
beta' * p(t-1) - c
where beta'=[1 beta2] and p(t-1) is the (2x1) vector of the previous stock prices. Notice that I seek a normalized first coefficient of the cointegration vector. c is a constant which is allowed within the cointegration relationship.
I am using Matlab to run the tests (jcitest), but have also tried utilizing Eviews for comparison of results. The two programs yields the same.
When I run the test and find two stocks to be cointegrated, I usually get output like
beta_1 = 12.7290
beta_2 = -35.9655
c = 121.3422
Since I want a normalized first beta coefficient, I set beta1 = 1 and obtain
beta_2 = -35.9655/12.7290 = -2.8255
c =121.3422/12.7290 = 9.5327
I can then generate the spread as beta' * p(t-1) - c. When the spread gets sufficiently low, I buy 1 share of stock 1 and short beta_2 shares of stock 2 and vice versa when the spread gets high.
~~~~~~~~~~~~~~~~ The problem ~~~~~~~~~~~~~~~~~~~~~~~
Since I am testing an awful lot of stock pairs, I obtain a lot of output. Quite often, however, I receive output where the estimated beta_1 and beta_2 are of the same sign, e.g.
beta_1= -1.4
beta_2= -3.9
When I normalize these according to beta_1, I get:
beta_1 = 1
beta_2 = 2.728
The current pairs trading literature doesn't mention any cases where the betas are of the same sign - how should it be interpreted? Since this is pairs trading, I am supposed to long one stock and short the other when the spread deviates from its long run mean. However, when the betas are of the same sign, to me it seems that I should always go long/short in both at the same time? Is this the correct interpretation? Or should I modify the way in which I normalize the coefficients?
I could really use some help...
EXTRA QUESTION:
Under some of my tests, I reject both the hypothesis of r=0 cointegration relationships and r<=1 cointegration relationships. I find this very mysterious, as I am only considering two variables at a time, and there can, at maximum, only be r=1 cointegration relationship. Can anyone tell me what this means?

network security- cryptography

I was solving a RSA problem and facing difficulty to compute d
plz help me with this
given p-971, q-52
Ø(n) - 506340
gcd(Ø(n),e) = 1 1< e < Ø(n)
therefore gcd(506340, 83) = 1
e= 83 .
e * d mod Ø(n) = 1
i want to compute d , i have all the info
can u help me how to computer d from this.
(83 * d) mod 506340 = 1
i am a little wean in maths so i am having difficulties finding d from the above equation.
Your value for q is not prime 52=2^2 * 13. Therefore you cannot find d because the maths for calculating this relies upon the fact the both p and q are prime.
I suggest working your way through the examples given here http://en.wikipedia.org/wiki/RSA_%28cryptosystem%29
Normally, I would hesitate to suggest a wikipedia link such as that, but I found it very useful as a preliminary source when doing a project on RSA as part of my degree.
You will need to be quite competent at modular arithmetic to get to grips with how RSA works. If you want to understand how to find d you will need to learn to find the Modular multiplicative inverse - just google this, I didn't come across anything incorrect when doing so myself.
Good luck.
A worked example
Let's take p=11, q=5. In reality you would use very large primes but we are going to be doing this by hand to we want smaller numbers. Keep both of these private.
Now we need n, which is given as n=pq and so in our case n=55. This needs to be made public.
The next item we need is the totient of n. This is simply phi(n)=(p-1)(q-1) so for our example phi(n)=40. Keep this private.
Now you calculate the encryption exponent, e. Defined such that 1<e<phi(n) and gcd(e,phi(n))=1. There are nearly always many possible different values of e - just pick one (in a real application your choice would be determined by additional factors - different choices of e make the algorithm easier/harder to crack). In this example we will choose e=7. This needs to be made public.
Finally, the last item to be calculated is d, the decryption exponent. To calculate d we must solve the equation ed mod phi(n) = 1. This is most commonly calculated using the Extended Euclidean Algorithm. This algorithm solves the equation phi(n)x+ed=1 subject to 1<d<phi(n), where x is an unknown multiplicative factor - which is identical to writing the previous equation without using mod. In our particular example, solving this leads to d=23. This should be kept private.
Then your public key is: n=55, e=7
and your private key is: n=55, d=23
To see the workthrough of the Extended Euclidean Algorithm check out this youtube video https://www.youtube.com/watch?v=kYasb426Yjk. The values used in that video are the same as the ones used here.
RSA is complicated and the mathematics gets very involved. Try solving a couple of examples with small values of p and q until you are comfortable with the method before attempting a problem with large values.

Assessing performance of a zero inflated negative binomial model

I am modelling the diffusion of movies through a contact network (based on telephone data) using a zero inflated negative binomial model (package: pscl)
m1 <- zeroinfl(LENGTH_OF_DIFF ~ ., data = trainData, type = "negbin")
(variables described below.)
The next step is to evaluate the performance of the model.
My attempt has been to do multiple out-of-sample predictions and calculate the MSE.
Using
predict(m1, newdata = testData)
I received a prediction for the mean length of a diffusion chain for each datapoint, and using
predict(m1, newdata = testData, type = "prob")
I received a matrix containing the probability of each datapoint being a certain length.
Problem with the evaluation: Since I have a 0 (and 1) inflated dataset, the model would be correct most of the time if it predicted 0 for all the values. The predictions I receive are good for chains of length zero (according to the MSE), but the deviation between the predicted and the true value for chains of length 1 or larger is substantial.
My question is:
How can we assess how well our model predicts chains of non-zero length?
Is this approach the correct way to make predictions from a zero inflated negative binomial model?
If yes: how do I interpret these results?
If no: what alternative can I use?
My variables are:
Dependent variable:
length of the diffusion chain (count [0,36])
Independent variables:
movie characteristics (both dummies and continuous variables).
Thanks!
It is straightforward to evaluate RMSPE (root mean square predictive error), but is probably best to transform your counts beforehand, to ensure that the really big counts do not dominate this sum.
You may find false negative and false positive error rates (FNR and FPR) to be useful here. FNR is the chance that a chain of actual non-zero length is predicted to have zero length (i.e. absence, also known as negative). FPR is the chance that a chain of actual zero length is falsely predicted to have non-zero (i.e. positive) length. I suggest doing a Google on these terms to find a paper in your favourite quantitative journals or a chapter in a book that helps explain these simply. For ecologists I tend to go back to Fielding & Bell (1997, Environmental Conservation).
First, let's define a repeatable example, that anyone can use (not sure where your trainData comes from). This is from help on zeroinfl function in the pscl library:
# an example from help on zeroinfl function in pscl library
library(pscl)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
There are several packages in R that calculate these. But here's the by hand approach. First calculate observed and predicted values.
# store observed values, and determine how many are nonzero
obs <- bioChemists$art
obs.nonzero <- obs > 0
table(obs)
table(obs.nonzero)
# calculate predicted counts, and check their distribution
preds.count <- predict(fm_zinb2, type="response")
plot(density(preds.count))
# also the predicted probability that each item is nonzero
preds <- 1-predict(fm_zinb2, type = "prob")[,1]
preds.nonzero <- preds > 0.5
plot(density(preds))
table(preds.nonzero)
Then get the confusion matrix (basis of FNR, FPR)
# the confusion matrix is obtained by tabulating the dichotomized observations and predictions
confusion.matrix <- table(preds.nonzero, obs.nonzero)
FNR <- confusion.matrix[2,1] / sum(confusion.matrix[,1])
FNR
In terms of calibration we can do it visually or via calibration
# let's look at how well the counts are being predicted
library(ggplot2)
output <- as.data.frame(list(preds.count=preds.count, obs=obs))
ggplot(aes(x=obs, y=preds.count), data=output) + geom_point(alpha=0.3) + geom_smooth(col="aqua")
Transforming the counts to "see" what is going on:
output$log.obs <- log(output$obs)
output$log.preds.count <- log(output$preds.count)
ggplot(aes(x=log.obs, y=log.preds.count), data=output[!is.na(output$log.obs) & !is.na(output$log.preds.count),]) + geom_jitter(alpha=0.3, width=.15, size=2) + geom_smooth(col="blue") + labs(x="Observed count (non-zero, natural logarithm)", y="Predicted count (non-zero, natural logarithm)")
In your case you could also evaluate the correlations, between the predicted counts and the actual counts, either including or excluding the zeros.
So you could fit a regression as a kind of calibration to evaluate this!
However, since the predictions are not necessarily counts, we can't use a poisson
regression, so instead we can use a lognormal, by regressing the log
prediction against the log observed, assuming a Normal response.
calibrate <- lm(log(preds.count) ~ log(obs), data=output[output$obs!=0 & output$preds.count!=0,])
summary(calibrate)
sigma <- summary(calibrate)$sigma
sigma
There are more fancy ways of assessing calibration I suppose, as in any modelling exercise ... but this is a start.
For a more advanced assessment of zero-inflated models, check out the ways in which the log likelihood can be used, in the references provided for the zeroinfl function. This requires a bit of finesse.

Spearman's rank correlation significance

I'm calculating Spearman's rank correlation in matlab with the following code:
[RHO,PVAL] = corr(x,y,'Type','Spearman');
RHO =
0.7211
PVAL =
4.9473e-04
and then with different variables
[RHO,PVAL] = corr(x2,y2,'Type','Spearman');
RHO =
0.3277
PVAL =
0.0060
How do you categorize these as p < 0.05, p < 0.01, p < 0.001 etc. Commonly in scientific journals these pvalues are represented as the examples I've shown and not as one number. Would both of these be p < 0.01? When defining whether a correlation is significant to a specific value do you always look for the smallest error i.e if its PVAL = 0.0005, both p > 0.05 and p > 0.001 would be correct here, do we simply write the lowest i.e. p > 0.001?
As Martin Dinov wrote, this is at least partially a matter of journal policy. But, as long as there is no explicit journal convention against it, I would recommend to always report the actual p-value, in this case in the form p = 4.9·10-4 and p = 0.006, respectively. You can then proceed to say that the effect you found is statistically significant, usually based on comparison with a previously chosen significance level, typically 0.05, unless you need to correct for multiple comparisons.
The reason is that the commonly used significance levels are purely a matter of convention. By only saying that p is below one conventional threshold means to withhold valuable information from the reader, which she might use to make up her own mind about the result – and this truncation is not even justified by relevant saving of print space.
You should also, of course, report the value of the correlation coefficient itself (which in this case doubles as a test statistic and an effect size) as well as the sample size.
At least for the field of psychology, these are official recommendations:
Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval.
…
Effect sizes. Always present effect sizes for primary outcomes. If the units of measurement are meaningful on a practical level (e.g., number of cigarettes smoked per day), then we usually prefer an unstandardized measure (regression coefficient or mean difference) to a standardized measure (r or d).
L. Wilkinson and the Task Force on Statistical Inference, "Statistical Methods in Psychology Journals. Guidelines and Explanations"
You mean pval is < 0.05 and also < 0.001 and not >. In general, you do want to show that it is smaller than the smallest significance level (alpha) threshold that you can. So yes, it is best to say for the second example that the p-value is < 0.001. Depending on the journal convention, it may be preferable to put the actual p-value in (so, for the first example, 4.9473e-04) or just that it's < some good alpha (0.0001 for the first case).