MCMCglmm questions: multiple species and ultrametric trees - mixed-models

These questions are related to my other question at Phylogenetic model using multiple entries for each species
Thanks to #thomas-guillerme, I was able to start running an MCMCglmm model.
Although I had no problem running some of my example files in which I had a single entry for each of the species in my tree, I found an error message when trying to run my original dataset, which consists of thousands of entries for each of the species in my tree. When running:
comp_data <- comparative.data(phy = my_tree, data =my_data, names.col = species, vcv = TRUE)’
I got an error:
'Error in row.names<-.data.frame(tmp, value = value) : duplicate
'row.names' are not allowed In addition: Warning message: non-unique
values when setting 'row.names': ‘Species1’, ‘Species2’,
‘Species3’, ‘Species4’,...
I was surprised because I am using MCMCglmm and not PGLS because of the chance of using multiple entries for each species.
I tried the workaround of make the species name unique but in that case only the first entry of each species is recognized later in the model (because it corresponds with the name in my_tree).
Moreover, I had problems with having my tree recognized as ultrametric. I checked it using
'is.ultrametric(my_tree)'
Got:
FALSE
I tried:
function (phy) { if(any(is.ultrametric(my_tree)) == FALSE) { my_tree <- lapply(my_tree, chronoMPL) class(my_tree) <- "Phylo"
}
}
But these lines apparently do not solve the problem. Thanks in advance for your help.

Hard to tell without a running example but for the second question at least, it seems that the bug comes from the phy argument not being passed to the function at all (it's using my_tree
check.fun <- function(my_tree) {
if(any(is.ultrametric(my_tree)) == FALSE) {
my_tree <- lapply(my_tree, chronoMPL)
class(my_tree) <- "Phylo"
}
return(my_tree)
}
For the first point, you might want to try to run it through the mulTree package that does a lot of housekeeping:
## Loading/installing the package
library(devtools)
install_github("TGuillerme/mulTree")
library(mulTree)
## Loading the example data
data(lifespan)
## Randomly combining trees
combined_trees <- tree.bind(x = trees_mammalia, y = trees_aves, sample = 2,
root.age = 250)
We can then generate an example with multiple specimens per species:
## Subset of the data
data <- lifespan_volant[sample(nrow(lifespan_volant), 30),]
## Create a dataset with two specimen per species
data <- rbind(cbind(data, specimen = rep("spec1", 30)), cbind(data,
specimen = rep("spec2", 30)))
Note that the first column contains the list of species with multiple specimens per species (specified in column $specimen)
head(data[order(data$species),])
# species class longevity mass volant specimen
#16 Addax_nasomaculatus Mammalia 0.8413927 1.8227058 nonvolant spec1
#161 Addax_nasomaculatus Mammalia 0.8413927 1.8227058 nonvolant spec2
#140 Anser_anser Aves 0.9929849 0.5993055 volant spec1
#1401 Anser_anser Aves 0.9929849 0.5993055 volant spec2
#21 Antilope_cervicapra Mammalia 0.6055864 1.4910746 nonvolant spec1
#211 Antilope_cervicapra Mammalia 0.6055864 1.4910746 nonvolant spec2
You can then use the clean.data function to make sure the trees match the dataset (specifying which column contains the species names)
## Making sure both the trees and the data match
cleaned_data <- clean.data(data, combined_trees, data.col = "species")
## Only using the cleaned version
trees <- cleaned_data$tree
data <- cleaned_data$data
You can find the eventual dropped tips/rows in cleaned_data$dropped_tips and cleaned_data$dropped_rows:
## Creates a mulTree object specifying species AND specimens as random terms
mulTree_data <- as.mulTree(data, trees, taxa = "species",
rand.terms = ~species+specimen)
## formula to test
test_formula <- longevity ~ mass + volant
## MCMC parameters (number of generations, thin/sampling, burnin)
mcmc_parameters <- c(101000, 10, 1000)
## priors
mcmc_priors <- list(R = list(V = 1/2, nu = 0.002),
G = list(G1 = list(V = 1/2, nu = 0.002)))
## Running MCMCglmm on multiple trees
mulTree(mulTree_data, formula = test_formula, parameters = mcmc_parameters,
priors = mcmc_priors, output = "longevity.example", ESS = 50)
To analyse the resulting files, you can use read.mulTree and subsequent functions (see the mulTree manual).

Related

Issues with ordiplot3d NMDS in 3dvegan package

I am looking for some help here with this 3d NMDS code. I have 3 issues.
The layout of the plot moves significantly each time I execute the code.
The sites and species are sometimes far off of the plot.
The species text is often overlapping. How can I fix this?
I am unsure how to change the plotting environment to ggplot, so that might be out of the question.
library(vegan)
library(vegan3d)
library(tidyverse)
data("dune")
SiteID <- 1:20
NMDS = metaMDS(dune,distance="bray", try=500, wascores = TRUE, k=3)
NMDS1 = NMDS$points[,1]
NMDS2 = NMDS$points[,2]
NMDS3 = NMDS$points[,3]
NMDS = data.frame(NMDS1 = NMDS1, NMDS2 = NMDS2, NMDS3 = NMDS3, SiteID=SiteID)
NMDS_input <- metaMDS(dune,distance="bray",try=500,k=3,wascores = T)
pl4 <- with(NMDS, ordiplot3d(NMDS_input, pch=16, angle=50, main="Fish ion level 3", cex.lab=1.7,cex.symbols=1.5, tick.marks=FALSE))
sp <- scores(NMDS_input, choices=1:3, display="species", scaling="symmetric")
si <- scores(NMDS_input, choices=1:3, display="sites", scaling="symmetric")
text(pl4$xyz.convert(sp), rownames(sp), cex=0.7, xpd=TRUE)
sii <- as.data.frame(cbind(NMDS$SiteID,si))
with(NMDS, orditorp(pl4, labels = sii$V1, air=1, cex = 1))
labels must be character variables in orditorp. We always assumed so, but this was not checked in vegan::orditorp. Latest vegan version in github will take care of this and will also work with numeric labels.
ordiplot3d returns projected coordinates (in 2D) and if you want to plot those, you can just use the pl4 object that you saved and you do not need to use pl4$xyz.convert. This object will also be accepted in orditorp.
If you want to plot points that were not used in the original mock-3D plot, you must use pl4$xyz.convert for their 2D projection. This function will return the projected coordinates in a form that is directly accepted by standard R functions text, points (and some others), but they will not be accepted by orditorp (and I won't change this). You must make these into two-column matrix-like object; data.frame() will work.
Your example code contains a lot of un-needed code. The following is an edit with only necessary lines and fixes that make this example work with current vegan release.
library(vegan)
library(vegan3d)
data(dune)
SiteID <- as.character(1:20) # must be character
NMDS_input <- metaMDS(dune,distance="bray",try=500,k=3,wascores = T)
pl4 <- ordiplot3d(NMDS_input, pch=16, angle=50, main="Fish ion level 3", cex.lab=1.7,cex.symbols=1.5, tick.marks=FALSE) # no with(NMDS,...)
sp <- scores(NMDS_input, choices=1:3, display="species") # no arg scaling in scores.metaMDS
text(pl4$xyz.convert(sp), rownames(sp), cex=0.7, xpd=TRUE)
orditorp(pl4, labels = SiteID, air=1, cex = 1) # character labels w/points in the same location

Trying to use Distributed data parallel on GANs but getting runtime error about an inplace operation

I am trying to train a GAN a machine with 3GPUs using distributed data parallel.
before wrapping my model in the DDP everything works fine but when I wrap it, it givers me the following Runtime Error
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [128]] is at version 5; expected version 4 instead.
I cloned every related tensor to the gradient to solve the inplace operation (if it is any) but I could not find it.
the part of code with the problem is as follow:
Tensor = torch.cuda.FloatTensor
# ----------
# Training
# ----------
def train_gan(rank, world_size, opt):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
if rank == 0:
get_dataloader(rank, opt)
dist.barrier()
print(f"Rank {rank}/{world_size} training process passed data download barrier.\n")
dataloader = get_dataloader(rank, opt)
# Loss function
adversarial_loss = torch.nn.BCELoss()
# Initialize generator and discriminator
generator = Generator()
discriminator = Discriminator()
# Initialize weights
generator.apply(weights_init_normal)
discriminator.apply(weights_init_normal)
generator.to(rank)
discriminator.to(rank)
generator_d = DDP(generator, device_ids=[rank])
discriminator_d = DDP(discriminator, device_ids=[rank])
# Optimizers
# Since we are computing the average of several batches at once (an effective batch size of
# world_size * batch_size) we scale the learning rate to match.
optimizer_G = torch.optim.Adam(generator_d.parameters(), lr=opt.lr * opt.world_size, betas=(opt.b1, opt.b2))
optimizer_D = torch.optim.Adam(discriminator_d.parameters(), lr=opt.lr * opt.world_size, betas=(opt.b1, opt.b2))
losses = []
for epoch in range(opt.n_epochs):
for i, (imgs, _) in enumerate(dataloader):
# Adversarial ground truths
valid = Variable(Tensor(imgs.shape[0], 1).fill_(1.0), requires_grad=False).to(rank)
fake = Variable(Tensor(imgs.shape[0], 1).fill_(0.0), requires_grad=False).to(rank)
# Configure input
real_imgs = Variable(imgs.type(Tensor)).to(rank)
# -----------------
# Train Generator
# -----------------
optimizer_G.zero_grad()
# Sample noise as generator input
z = Variable(Tensor(np.random.normal(0, 1, (imgs.shape[0], opt.latent_dim)))).to(rank)
# Generate a batch of images
gen_imgs = generator_d(z)
# Loss measures generator's ability to fool the discriminator
g_loss = adversarial_loss(discriminator_d(gen_imgs), valid)
g_loss.backward()
optimizer_G.step()
# ---------------------
# Train Discriminator
# ---------------------
optimizer_D.zero_grad()
# Measure discriminator's ability to classify real from generated samples
real_loss = adversarial_loss(discriminator_d(real_imgs), valid)
fake_loss = adversarial_loss(discriminator_d(gen_imgs.detach()), fake)
d_loss = ((real_loss + fake_loss) / 2).to(rank)
d_loss.backward()
optimizer_D.step()
I encountered a similar error when trying to train a GAN with DistributedDataParallel.
I noticed the problem was coming from BatchNorm layers in my discriminator.
Indeed, DistributedDataParallel synchronizes the batchnorm parameters at each forward pass (see the doc), thereby modifying the variable inplace, which causes problems if you have multiple forward passes in a row.
Converting my BatchNorm layers to SyncBatchNorm did the trick for me:
discriminator = torch.nn.SyncBatchNorm.convert_sync_batchnorm(discriminator)
discriminator = DPP(discriminator)
You probably want to do it anyway when using DistributedDataParallel.
Alternatively, if you don't want to use SyncBatchNorm, you can set the broadcast_buffers parameter to False, but I don't think you really want to do that, as it means your batch norm stats will not be synchronized among processes.
discriminator = DPP(discriminator, device_ids=[rank], broadcast_buffers=False)

Phylogenetic model using multiple entries for each species

I am relatively new to phylogenetic regression models. In the past I used PGLS when I had only 1 entry for each species in my tree. Now I have a dataset with thousands of records for a total of 9 species and I would like to run a phylogenetic model. I read the tutorial of the most common packages (e.g. caper) but I am unsure how to build the model.
When I try to create the object for caper, i.e. using:
obj <- comparative.data(phy = Tree, data = Data, names.col = species, vcv = TRUE, na.omit = FALSE, warn.dropped = TRUE)
I get the message:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘Species1’, ‘Species2’, ‘Species3’, ‘Species4’, ‘Species5’, ‘Species6’, ‘Species7’, ‘Species8’, ‘Species9’
I understood that I may solve this by applying a MCMCglmm model but I am unfamiliar with Bayesian models.
Thanks in advance for your help.
This is indeed not going to work with a simple PGLS from caper because it cannot deal with individuals as a random effect. I suggest you use MCMCglmm that is not much more complex to understand and will allow you to have individuals as a random effect. You can find excellent documentation from the package's author here or here or an alternative documentation that's more dealing with some specific aspects of the package (namely tree uncertainty) here.
Really briefly to get you going:
## Your comparative data
comp_data <- comparative.data(phy = my_tree, data =my_data,
names.col = species, vcv = TRUE)
Note that you can have a specimen column that can look like this:
taxa var1 var2 specimen
1 A 0.08730689 a spec1
2 B 0.47092692 a spec1
3 C -0.26302706 b spec1
4 D 0.95807782 b spec1
5 E 2.71590217 b spec1
6 A -0.40752058 a spec2
7 B -1.37192856 a spec2
8 C 0.30634567 b spec2
9 D -0.49828379 b spec2
10 E 1.42722363 b spec2
You can then set up your formula (similar to a simple lm formula):
## Your formula
my_formula <- variable1 ~ variable2
And your MCMC settings:
## Setting the prior list (see the MCMCglmm course notes for details)
prior <- list(R = list(V=1, nu=0.002),
G = list(G1 = list(V=1, nu=0.002)))
## Setting the MCMC parameters
## Number of interactions
nitt <- 12000
## Length of burnin
burnin <- 2000
## Amount of thinning
thin <- 5
And you should then be able to run a default MCMCglmm:
## Extracting the comparative data
mcmc_data <- comp_data$data
## As MCMCglmm requires a column named animal for it to identify it as a phylo
## model we include an extra column with the species names in it.
mcmc_data <- cbind(animal = rownames(mcmc_data), mcmc_data)
mcmc_tree <- comp_data$phy
## The MCMCglmmm
mod_mcmc <- MCMCglmm(fixed = my_formula,
random = ~ animal + specimen,
family = "gaussian",
pedigree = mcmc_tree,
data = mcmc_data,
nitt = nitt,
burnin = burnin,
thin = thin,
prior = prior)

a function for bigglm model selection like dredge working for glm

I was using glm with dredge in MuMIn package. But now since my data is large I am using bigglm from biglm package. Now how do I do model selection now since dredge does not work with bigglm? Is there another package I can use to achieve this?
On applying the dredge on bigglm I am receiving the following error:
Error in nobs.default(global.model) : no 'nobs' method is available
dredge relies on availability of logLik method for the the given model class. big[g]lm object does not provide such value, and there seems to be a long known bug in the AIC method for big[g]lm-class that makes it impossible to calculate LL from it (it uses deviance rather than LL to calculate AIC, so AIC-values are not comparable to other model types, see here: AIC different between biglm and lm).
You could try adding the missing methods (using deviance instead of LL, which may be slippery):
# incorrect if any prior weights are 0
nobs.biglm <- function (object, ...) object$n
logLik.bigglm <- function(object, ...) {
dev <- deviance(object, ...)
df <- object$n - object$df.resid
structure(dev, df = df, nobs = object$n)
}
coefTable.biglm <- function (model, data, ...) {
ct <- summary(model)$mat[, c(1L,4L,5L), drop = FALSE]
.makeCoefTable(ct[, 1L], se = ct[, 2L], df = model$df.resid, coefNames = rownames(ct))
}
environment(coefTable.biglm) <- asNamespace("MuMIn")
#from example(bigglm)
fm <- bigglm(log(Volume)~log(Girth)+log(Height),data=trees, chunksize=10, sandwich=TRUE)
dredge(fm, rank = AIC)

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.