How to find subset selection for linear regression model? - linear-regression

I am working with mtcars dataset and using linear regression
data(mtcars)
fit<- lm(mpg ~.,mtcars);summary(fit)
When I fit the model with lm it shows the result like this
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.5087 -1.3584 -0.0948 0.7745 4.6251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.87913 20.06582 1.190 0.2525
cyl6 -2.64870 3.04089 -0.871 0.3975
cyl8 -0.33616 7.15954 -0.047 0.9632
disp 0.03555 0.03190 1.114 0.2827
hp -0.07051 0.03943 -1.788 0.0939 .
drat 1.18283 2.48348 0.476 0.6407
wt -4.52978 2.53875 -1.784 0.0946 .
qsec 0.36784 0.93540 0.393 0.6997
vs1 1.93085 2.87126 0.672 0.5115
amManual 1.21212 3.21355 0.377 0.7113
gear4 1.11435 3.79952 0.293 0.7733
gear5 2.52840 3.73636 0.677 0.5089
carb2 -0.97935 2.31797 -0.423 0.6787
carb3 2.99964 4.29355 0.699 0.4955
carb4 1.09142 4.44962 0.245 0.8096
carb6 4.47757 6.38406 0.701 0.4938
carb8 7.25041 8.36057 0.867 0.3995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.833 on 15 degrees of freedom
Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
I found that none of variables are marked as significant at 0.05 significant level.
To find out significant variables I want to to do subset selection to find out best pair of vairables as predictors with response variable mpg.

The function regsubsets in the package leaps does best subset regression (see ?leaps). Adapting your code:
library(leaps)
regfit <- regsubsets(mpg ~., data = mtcars)
summary(regfit)
# or for a more visual display
plot(regfit,scale="Cp")

Related

Adonis result does not match the NMDS plot?

I'm not a statistician and kinda working through this blindly, but why does my NMDS plot of bray-curtis measures show clear groupings for status (blue and pink) but adonis says that time point is significantly different?
distance_calc <- phyloseq::distance(rarified2, "bray")
sampledf <- data.frame(sample_data(rarified2))
adonisresult<- adonis2(formula = distance_calc ~ Status * Time.Point, data = sampledf, type = "bray")
> adonisresult
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = adonisf, data = sampledf, type = "bray")
Df SumOfSqs R2 F Pr(>F)
Status 1 0.2534 0.05397 1.2969 0.198
Time.Point 1 0.4277 0.09111 2.1893 0.016 *
Status:Time.Point 1 0.1061 0.02259 0.5429 0.901
Residual 20 3.9072 0.83232
Total 23 4.6943 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

mgcv: Difference between s(x, by=cat) and s(cat, bs='re')

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?
The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

stan number of effective sample size

I reproduced the results of a hierarchical model using the rethinking package with just rstan() and I am just curious why n_eff is not closer.
Here is the model with random intercepts for 2 groups (intercept_x2) using the rethinking package:
Code:
response = c(rnorm(500,0,1),rnorm(500,200,10))
predicotr1_continuous = rnorm(1000)
predictor2_categorical = factor(c(rep("A",500),rep("B",500) ))
data = data.frame(y = response, x1 = predicotr1_continuous, x2 = predictor2_categorical)
head(data)
library(rethinking)
m22 <- map2stan(
alist(
y ~ dnorm( mu , sigma ) ,
mu <- intercept + intercept_x2[x2] + beta*x1 ,
intercept ~ dnorm(0,10),
intercept_x2[x2] ~ dnorm(0, sigma_2),
beta ~ dnorm(0,10),
sigma ~ dnorm(0, 10),
sigma_2 ~ dnorm(0,10)
) ,
data=data , chains=1 , iter=5000 , warmup=500 )
precis(m22, depth = 2)
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
intercept 9.96 9.59 -5.14 25.84 1368 1
intercept_x2[1] -9.94 9.59 -25.55 5.43 1371 1
intercept_x2[2] 189.68 9.59 173.28 204.26 1368 1
beta 0.06 0.22 -0.27 0.42 3458 1
sigma 6.94 0.16 6.70 7.20 2927 1
sigma_2 43.16 5.01 35.33 51.19 2757 1
Now here is the same model in rstan():
# create a numeric vector to indicate the categorical groups
data$GROUP_ID = match( data$x2, levels( data$x2 ) )
library(rstan)
standat <- list(
N = nrow(data),
y = data$y,
x1 = data$x1,
GROUP_ID = data$GROUP_ID,
nGROUPS = 2
)
stanmodelcode = '
data {
int<lower=1> N;
int nGROUPS;
real y[N];
real x1[N];
int<lower=1, upper=nGROUPS> GROUP_ID[N];
}
transformed data{
}
parameters {
real intercept;
vector[nGROUPS] intercept_x2;
real beta;
real<lower=0> sigma;
real<lower=0> sigma_2;
}
transformed parameters { // none needed
}
model {
real mu;
// priors
intercept~ normal(0,10);
intercept_x2 ~ normal(0,sigma_2);
beta ~ normal(0,10);
sigma ~ normal(0,10);
sigma_2 ~ normal(0,10);
// likelihood
for(i in 1:N){
mu = intercept + intercept_x2[ GROUP_ID[i] ] + beta*x1[i];
y[i] ~ normal(mu, sigma);
}
}
'
fit22 = stan(model_code=stanmodelcode, data=standat, iter=5000, warmup=500, chains = 1)
fit22
Inference for Stan model: b212ebc67c08c77926c59693aa719288.
1 chains, each with iter=5000; warmup=500; thin=1;
post-warmup draws per chain=4500, total post-warmup draws=4500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept 10.14 0.30 9.72 -8.42 3.56 10.21 16.71 29.19 1060 1
intercept_x2[1] -10.12 0.30 9.73 -29.09 -16.70 -10.25 -3.50 8.36 1059 1
intercept_x2[2] 189.50 0.30 9.72 170.40 182.98 189.42 196.09 208.05 1063 1
beta 0.05 0.00 0.21 -0.37 -0.10 0.05 0.20 0.47 3114 1
sigma 6.94 0.00 0.15 6.65 6.84 6.94 7.05 7.25 3432 1
sigma_2 43.14 0.09 4.88 34.38 39.71 42.84 46.36 53.26 3248 1
lp__ -2459.75 0.05 1.71 -2463.99 -2460.68 -2459.45 -2458.49 -2457.40 1334 1
Samples were drawn using NUTS(diag_e) at Thu Aug 31 15:53:09 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
My Questions:
the n_eff is larger using rethinking(). There is simulation differences but do you think something else is going on here?
Besides the n_eff being different the percentiles of the posterior distributions are different. I was thinking rethinking() and rstan() should return similar results with 5000 iterations since rethinking is just calling rstan. Are differences like that normal or something different between the 2 implementations?
I created data$GROUP_ID to indicate the categorical groupings. Is this the correct way to incorporate categorical variables into a hierarchical model in rstan()? I have 2 groups and if I had 50 groups I use the same data$GROUP_ID vector but is that the standard way?
Thank you.

Decide best 'k' in k-means algorithm in weka

I am using k-means algorithm for clustering but I am not sure how to decide best optimal value of k based on the results.
For ex, i have applied k-means on a dataset for k=10:
kMeans
======
Number of iterations: 16
Within cluster sum of squared errors: 38.47923197081721
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4 5 6 7 8 9
(214) (16) (9) (13) (23) (46) (12) (11) (40) (15) (29)
==============================================================================================================================================================================================================================================================
RI 1.5184 1.5181 1.5175 1.5189 1.5178 1.5172 1.519 1.5255 1.5175 1.5222 1.5171
Na 13.4079 12.9988 14.6467 12.8277 13.2148 13.1896 13.63 12.6318 13.0518 13.9107 14.4421
Mg 2.6845 3.4894 1.3056 0.7738 3.4261 3.4987 3.4917 0.2145 3.4958 3.8273 0.5383
Al 1.4449 1.1844 1.3667 2.0338 1.3552 1.4898 1.3308 1.1891 1.2617 0.716 2.1228
Si 72.6509 72.785 73.2067 72.3662 72.6526 72.6989 72.07 72.0709 72.9532 71.7467 72.9659
K 0.4971 0.4794 0 1.47 0.527 0.59 0.4108 0.2345 0.547 0.1007 0.3252
Ca 8.957 8.8069 9.3567 10.1238 8.5648 8.3041 8.87 13.1291 8.5035 9.5887 8.4914
Ba 0.175 0.015 0 0.1877 0.023 0.003 0.0667 0.2864 0 0 1.04
Fe 0.057 0.2238 0 0.0608 0.2013 0.0104 0.0167 0.1109 0.011 0.0313 0.0134
Type build wind non-float build wind float tableware containers build wind non-float build wind non-float build wind float build wind non-float build wind float build wind float headlamps
There are various methods for deciding the optimal value for "k" in k-means algorithm Thumb-Rule, elbow method, silhouette method etc. In my work I used to follow the result obtained form the elbow method and got succeed with my results, I had done all the analysis in the R-Language.
Here is the link of the description for those methods link
Try to find the sub links of the given link, build a code for any one of the method & apply on your data.
I hope this will help you, if not I am sorry.
All the Best with your work.

Adaboost weka True positive vs False positive recognition issue

I am using Adaboost M1 algorithm in Weka Experiment Environment with default setup:
Runs (1-10) -> 10 runs to provide more statistically significant results
Random Split Result Producer
I use train percent to divide training from evaluation data
Now, the problem is with the Weighted average TP and FP results.
I get this:
TP:0.8
FP:0.47
But as far as I am aware, if TP rate is 0.8, the FP rate should be as high as 0.2?
I assume that this has to do something with 10 runs, but anyway if average values is taken from this run, again this FP rate should be much lower?
Sorry if this is too simple question, but from my logic this seems like error in Weka toolkit, or am I wrong? Thanks
EDIT:
In order to avoid asking a new question and because this is related to the same problem, can anyone answer what are Weighted average values displayed in Weka?
I have included the Atilla's example below: it can be seen that Weighted average are not Average values,e.g. AVG(0.933,0.422) != 0.77, etc.
Can someone answer what these values actually are?
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.933 0.578 0.776 0.933 0.847 0.429 0.844 0.917 tested_negative
0.422 0.067 0.745 0.422 0.538 0.429 0.844 0.696 tested_positive
Weighted Avg. 0.77 0.416 0.766 0.77 0.749 0.429 0.844 0.847
I run adoboostM1 with default parameters on diabetes data set of weka. I got following results.
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.933 0.578 0.776 0.933 0.847 0.429 0.844 0.917 tested_negative
0.422 0.067 0.745 0.422 0.538 0.429 0.844 0.696 tested_positive
Weighted Avg. 0.77 0.416 0.766 0.77 0.749 0.429 0.844 0.847
Notice that this TP Rate and FP rate is for each of your class values. Since I have two (2) values for class feature in this data set, I have two (2) lines.
Also notice that:
0.933 + 0.067 = 1
0.578 + 0.422 = 1
As you correctly pointed that TP rate + FP rate should be equal to one (1). So in your example: I assume that you have following class variable:
target {A,B}
TP Rate FP Rate
0.8 0.47 ..... for A
0.53 0.2 ..... for B