a function for bigglm model selection like dredge working for glm - linear-regression

I was using glm with dredge in MuMIn package. But now since my data is large I am using bigglm from biglm package. Now how do I do model selection now since dredge does not work with bigglm? Is there another package I can use to achieve this?
On applying the dredge on bigglm I am receiving the following error:
Error in nobs.default(global.model) : no 'nobs' method is available

dredge relies on availability of logLik method for the the given model class. big[g]lm object does not provide such value, and there seems to be a long known bug in the AIC method for big[g]lm-class that makes it impossible to calculate LL from it (it uses deviance rather than LL to calculate AIC, so AIC-values are not comparable to other model types, see here: AIC different between biglm and lm).
You could try adding the missing methods (using deviance instead of LL, which may be slippery):
# incorrect if any prior weights are 0
nobs.biglm <- function (object, ...) object$n
logLik.bigglm <- function(object, ...) {
dev <- deviance(object, ...)
df <- object$n - object$df.resid
structure(dev, df = df, nobs = object$n)
}
coefTable.biglm <- function (model, data, ...) {
ct <- summary(model)$mat[, c(1L,4L,5L), drop = FALSE]
.makeCoefTable(ct[, 1L], se = ct[, 2L], df = model$df.resid, coefNames = rownames(ct))
}
environment(coefTable.biglm) <- asNamespace("MuMIn")
#from example(bigglm)
fm <- bigglm(log(Volume)~log(Girth)+log(Height),data=trees, chunksize=10, sandwich=TRUE)
dredge(fm, rank = AIC)

Related

Problem in Implementing a Graphical Model Using Pyro

I am trying to implement this graphical model using Pyro:
My implementation is:
def model(data):
p = pyro.sample('p', dist.Beta(1, 1))
label_axis = pyro.plate("label_axis", data.shape[0], dim=-3)
f_axis = pyro.plate("f_axis", data.shape[1], dim=-2)
with label_axis:
l = pyro.sample('l', dist.Bernoulli(p))
with f_axis:
e = pyro.sample('e', dist.Beta(1, 10))
with label_axis, f_axis:
f = pyro.sample('f', dist.Bernoulli(1-e), obs=data)
f = l*f + (1-l)*(1-f)
return f
However, this doesn't seem to be right to me. The problem is "f". Since its distribution is different from Bernoulli. To sample from f, I used a sample from a Bernoulli distribution and then changed the sampled value if l=0. But I don't think that this would change the value that Pyro stores behind the scene for "f". This would be a problem when it's inferencing, right?
I wanted to use iterative plates instead of vectorized one, to be able to use control statements inside my plate. But apparently, this is not possible since I am reusing plates.
How can I correctly implement this PGM? Do I need to write a custom distribution? Or can I hack Pyro and change the stored value for "f" myself? Any type of help is appreciated! Cheers!
Here is the correct implementation:
import pyro
import pyro.distributions as dist
from pyro.infer import MCMC, NUTS
def model(data):
p = pyro.sample('p', dist.Beta(1, 1))
label_axis = pyro.plate("label_axis", data.shape[0], dim=-2)
f_axis = pyro.plate("f_axis", data.shape[1], dim=-1)
with label_axis:
l = pyro.sample('l', dist.Bernoulli(p))
with f_axis:
e = pyro.sample('e', dist.Beta(1, 10))
with label_axis, f_axis:
prob = l * (1 - e) + (1 - l) * e
return pyro.sample('f', dist.Bernoulli(prob), obs=data)
mcmc = MCMC(NUTS(model), 500, 500)
data = dist.Bernoulli(0.5).sample((20, 4))
mcmc.run(data)

MCMCglmm questions: multiple species and ultrametric trees

These questions are related to my other question at Phylogenetic model using multiple entries for each species
Thanks to #thomas-guillerme, I was able to start running an MCMCglmm model.
Although I had no problem running some of my example files in which I had a single entry for each of the species in my tree, I found an error message when trying to run my original dataset, which consists of thousands of entries for each of the species in my tree. When running:
comp_data <- comparative.data(phy = my_tree, data =my_data, names.col = species, vcv = TRUE)’
I got an error:
'Error in row.names<-.data.frame(tmp, value = value) : duplicate
'row.names' are not allowed In addition: Warning message: non-unique
values when setting 'row.names': ‘Species1’, ‘Species2’,
‘Species3’, ‘Species4’,...
I was surprised because I am using MCMCglmm and not PGLS because of the chance of using multiple entries for each species.
I tried the workaround of make the species name unique but in that case only the first entry of each species is recognized later in the model (because it corresponds with the name in my_tree).
Moreover, I had problems with having my tree recognized as ultrametric. I checked it using
'is.ultrametric(my_tree)'
Got:
FALSE
I tried:
function (phy) { if(any(is.ultrametric(my_tree)) == FALSE) { my_tree <- lapply(my_tree, chronoMPL) class(my_tree) <- "Phylo"
}
}
But these lines apparently do not solve the problem. Thanks in advance for your help.
Hard to tell without a running example but for the second question at least, it seems that the bug comes from the phy argument not being passed to the function at all (it's using my_tree
check.fun <- function(my_tree) {
if(any(is.ultrametric(my_tree)) == FALSE) {
my_tree <- lapply(my_tree, chronoMPL)
class(my_tree) <- "Phylo"
}
return(my_tree)
}
For the first point, you might want to try to run it through the mulTree package that does a lot of housekeeping:
## Loading/installing the package
library(devtools)
install_github("TGuillerme/mulTree")
library(mulTree)
## Loading the example data
data(lifespan)
## Randomly combining trees
combined_trees <- tree.bind(x = trees_mammalia, y = trees_aves, sample = 2,
root.age = 250)
We can then generate an example with multiple specimens per species:
## Subset of the data
data <- lifespan_volant[sample(nrow(lifespan_volant), 30),]
## Create a dataset with two specimen per species
data <- rbind(cbind(data, specimen = rep("spec1", 30)), cbind(data,
specimen = rep("spec2", 30)))
Note that the first column contains the list of species with multiple specimens per species (specified in column $specimen)
head(data[order(data$species),])
# species class longevity mass volant specimen
#16 Addax_nasomaculatus Mammalia 0.8413927 1.8227058 nonvolant spec1
#161 Addax_nasomaculatus Mammalia 0.8413927 1.8227058 nonvolant spec2
#140 Anser_anser Aves 0.9929849 0.5993055 volant spec1
#1401 Anser_anser Aves 0.9929849 0.5993055 volant spec2
#21 Antilope_cervicapra Mammalia 0.6055864 1.4910746 nonvolant spec1
#211 Antilope_cervicapra Mammalia 0.6055864 1.4910746 nonvolant spec2
You can then use the clean.data function to make sure the trees match the dataset (specifying which column contains the species names)
## Making sure both the trees and the data match
cleaned_data <- clean.data(data, combined_trees, data.col = "species")
## Only using the cleaned version
trees <- cleaned_data$tree
data <- cleaned_data$data
You can find the eventual dropped tips/rows in cleaned_data$dropped_tips and cleaned_data$dropped_rows:
## Creates a mulTree object specifying species AND specimens as random terms
mulTree_data <- as.mulTree(data, trees, taxa = "species",
rand.terms = ~species+specimen)
## formula to test
test_formula <- longevity ~ mass + volant
## MCMC parameters (number of generations, thin/sampling, burnin)
mcmc_parameters <- c(101000, 10, 1000)
## priors
mcmc_priors <- list(R = list(V = 1/2, nu = 0.002),
G = list(G1 = list(V = 1/2, nu = 0.002)))
## Running MCMCglmm on multiple trees
mulTree(mulTree_data, formula = test_formula, parameters = mcmc_parameters,
priors = mcmc_priors, output = "longevity.example", ESS = 50)
To analyse the resulting files, you can use read.mulTree and subsequent functions (see the mulTree manual).

Phylogenetic model using multiple entries for each species

I am relatively new to phylogenetic regression models. In the past I used PGLS when I had only 1 entry for each species in my tree. Now I have a dataset with thousands of records for a total of 9 species and I would like to run a phylogenetic model. I read the tutorial of the most common packages (e.g. caper) but I am unsure how to build the model.
When I try to create the object for caper, i.e. using:
obj <- comparative.data(phy = Tree, data = Data, names.col = species, vcv = TRUE, na.omit = FALSE, warn.dropped = TRUE)
I get the message:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘Species1’, ‘Species2’, ‘Species3’, ‘Species4’, ‘Species5’, ‘Species6’, ‘Species7’, ‘Species8’, ‘Species9’
I understood that I may solve this by applying a MCMCglmm model but I am unfamiliar with Bayesian models.
Thanks in advance for your help.
This is indeed not going to work with a simple PGLS from caper because it cannot deal with individuals as a random effect. I suggest you use MCMCglmm that is not much more complex to understand and will allow you to have individuals as a random effect. You can find excellent documentation from the package's author here or here or an alternative documentation that's more dealing with some specific aspects of the package (namely tree uncertainty) here.
Really briefly to get you going:
## Your comparative data
comp_data <- comparative.data(phy = my_tree, data =my_data,
names.col = species, vcv = TRUE)
Note that you can have a specimen column that can look like this:
taxa var1 var2 specimen
1 A 0.08730689 a spec1
2 B 0.47092692 a spec1
3 C -0.26302706 b spec1
4 D 0.95807782 b spec1
5 E 2.71590217 b spec1
6 A -0.40752058 a spec2
7 B -1.37192856 a spec2
8 C 0.30634567 b spec2
9 D -0.49828379 b spec2
10 E 1.42722363 b spec2
You can then set up your formula (similar to a simple lm formula):
## Your formula
my_formula <- variable1 ~ variable2
And your MCMC settings:
## Setting the prior list (see the MCMCglmm course notes for details)
prior <- list(R = list(V=1, nu=0.002),
G = list(G1 = list(V=1, nu=0.002)))
## Setting the MCMC parameters
## Number of interactions
nitt <- 12000
## Length of burnin
burnin <- 2000
## Amount of thinning
thin <- 5
And you should then be able to run a default MCMCglmm:
## Extracting the comparative data
mcmc_data <- comp_data$data
## As MCMCglmm requires a column named animal for it to identify it as a phylo
## model we include an extra column with the species names in it.
mcmc_data <- cbind(animal = rownames(mcmc_data), mcmc_data)
mcmc_tree <- comp_data$phy
## The MCMCglmmm
mod_mcmc <- MCMCglmm(fixed = my_formula,
random = ~ animal + specimen,
family = "gaussian",
pedigree = mcmc_tree,
data = mcmc_data,
nitt = nitt,
burnin = burnin,
thin = thin,
prior = prior)

Prediction in R doesn't work following rpart

I am working on a tree regression. Everything works fine with my code but I don't get the predicted values at all. Instead I get all values for my y variable (response variable). Here's the code:
Separating in train and test set for data
`sample = sample.split(Data, SplitRatio = .80)
train = subset(Data, sample == TRUE)
test = subset(Data, sample == FALSE)
varYTrain <- train[c(3)]
varYTest <- test[c(3)]
varXTrain <- train[c(5:27)]
varXTest <- test[c(5:27)]`
Model
`x <- cbind(varXTrain,varYTrain)
fit <- rpart(as.matrix(varYTrain) ~ ., data = x, method="class")
summary(fit)`
This one doesn't work as I don't get predictions based on test data set for an unknown reason
`predicted <- predict(fit, data=varXTest)
summary(predicted)`
I would also like in the end for the output to compare predicted values to real values in my dataset, can I do that?
Thank you very much and don't hesitate to ask me a question if I am not clear enough it's my first time posting.
Cheers

Trying to balance my dataset through sample_weight in scikit-learn

I'm using RandomForest for classification, and I got an unbalanced dataset, as: 5830-no, 1006-yes. I try to balance my dataset with class_weight and sample_weight, but I can`t.
My code is:
X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
But I don't get any improvement on my ratios TPR, FPR, ROC when using class_weight and sample_weight.
Why? Am I doing anything wrong?
Nevertheless, if I use the function called balanced_subsample, my ratios obtain a great improvement:
def balanced_subsample(x,y,subsample_size):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
My new code is:
X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5)
X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
This is not a full answer yet, but hopefully it'll help get there.
First some general remarks:
To debug this kind of issue it is often useful to have a deterministic behavior. You can pass the random_state attribute to RandomForestClassifier and various scikit-learn objects that have inherent randomness to get the same result on every run. You'll also need:
import numpy as np
np.random.seed()
import random
random.seed()
for your balanced_subsample function to behave the same way on every run.
Don't grid search on n_estimators: more trees is always better in a random forest.
Note that sample_weight and class_weight have a similar objective: actual sample weights will be sample_weight * weights inferred from class_weight.
Could you try:
Using subsample=1 in your balanced_subsample function. Unless there's a particular reason not to do so we're better off comparing the results on similar number of samples.
Using your subsampling strategy with class_weight and sample_weight both set to None.
EDIT: Reading your comment again I realize your results are not so surprising!
You get a better (higher) TPR but a worse (higher) FPR.
It just means your classifier tries hard to get the samples from class 1 right, and thus makes more false positives (while also getting more of those right of course!).
You will see this trend continue if you keep increasing the class/sample weights in the same direction.
There is a imbalanced-learn API that helps with oversampling/undersampling data that might be useful in this situation. You can pass your training set into one of the methods and it will output the oversampled data for you. See simple example below
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1)
x_oversampled, y_oversampled = ros.fit_sample(orig_x_data, orig_y_data)
Here it the link to the API: http://contrib.scikit-learn.org/imbalanced-learn/api.html
Hope this helps!