gtsummary::tbl_regression() - Obtain Random Effects from GLMM Zero-Inflated Model - gtsummary

When trying to create a table with the conditional random effects in r using the gtsummary function tbl_regression from a glmmTMB mixed effects negative-binomial zero-inflated model, I get duplicate random effects rows.
Example (using Mollie Brooks' Zero-Inflated GLMMs on Salamanders Dataset):
data(Salamanders)
head(Salamanders)
library(glmmTMB)
zinbm2 = glmmTMB(count~spp + mined +(1|site), zi=~spp + mined + (1|site), Salamanders, family=nbinom2)
zinbm2_table_cond <- tbl_regression(
zinbm2,
tidy_fun = function(...) broom.mixed::tidy(..., component = "cond"),
exponentiate = TRUE,
estimate_fun = purrr::partial(style_ratio, digits = 3),
pvalue_fun = purrr::partial(style_sigfig, digits = 3))
zinbm2_table_cond
Output:
Random Effects Output (cond)
When extracting the random effects from de zero-inflated part of the model I get the same problem.
Example:
zinbm2_table_zi <- tbl_regression(
zinbm2,
tidy_fun = function(...) broom.mixed::tidy(..., component = "zi"),
exponentiate = TRUE,
estimate_fun = purrr::partial(style_ratio, digits = 3),
pvalue_fun = purrr::partial(style_sigfig, digits = 3))
zinbm2_table_zi
Output:
Random Effects Output (zi)
The problem persists if I specify the effects argument in broom.mixed.
tidy_fun = function(...) broom.mixed::tidy(..., effects = "ran_pars", component = "cond"),
Looking at confidence intervals in both outputs it seems that somehow it is extracting random effects from both parts of the model and changing the estimate of the zero-inflated random effects (in 1st image; opposite in the 2nd image) to match the conditional part estimate while keeping the CI.
I am not knowledgeable enough to understand why this is happening. Since both rows have the same label I am having difficulty removing the wrong one.
Any tips on how to avoid this problem or a workaround to remove the undesired rows?
If you need more info, let me know.
Thank you in advance.
PS: Output images were changed to link due to insufficient reputation.

Related

Issues with ordiplot3d NMDS in 3dvegan package

I am looking for some help here with this 3d NMDS code. I have 3 issues.
The layout of the plot moves significantly each time I execute the code.
The sites and species are sometimes far off of the plot.
The species text is often overlapping. How can I fix this?
I am unsure how to change the plotting environment to ggplot, so that might be out of the question.
library(vegan)
library(vegan3d)
library(tidyverse)
data("dune")
SiteID <- 1:20
NMDS = metaMDS(dune,distance="bray", try=500, wascores = TRUE, k=3)
NMDS1 = NMDS$points[,1]
NMDS2 = NMDS$points[,2]
NMDS3 = NMDS$points[,3]
NMDS = data.frame(NMDS1 = NMDS1, NMDS2 = NMDS2, NMDS3 = NMDS3, SiteID=SiteID)
NMDS_input <- metaMDS(dune,distance="bray",try=500,k=3,wascores = T)
pl4 <- with(NMDS, ordiplot3d(NMDS_input, pch=16, angle=50, main="Fish ion level 3", cex.lab=1.7,cex.symbols=1.5, tick.marks=FALSE))
sp <- scores(NMDS_input, choices=1:3, display="species", scaling="symmetric")
si <- scores(NMDS_input, choices=1:3, display="sites", scaling="symmetric")
text(pl4$xyz.convert(sp), rownames(sp), cex=0.7, xpd=TRUE)
sii <- as.data.frame(cbind(NMDS$SiteID,si))
with(NMDS, orditorp(pl4, labels = sii$V1, air=1, cex = 1))
labels must be character variables in orditorp. We always assumed so, but this was not checked in vegan::orditorp. Latest vegan version in github will take care of this and will also work with numeric labels.
ordiplot3d returns projected coordinates (in 2D) and if you want to plot those, you can just use the pl4 object that you saved and you do not need to use pl4$xyz.convert. This object will also be accepted in orditorp.
If you want to plot points that were not used in the original mock-3D plot, you must use pl4$xyz.convert for their 2D projection. This function will return the projected coordinates in a form that is directly accepted by standard R functions text, points (and some others), but they will not be accepted by orditorp (and I won't change this). You must make these into two-column matrix-like object; data.frame() will work.
Your example code contains a lot of un-needed code. The following is an edit with only necessary lines and fixes that make this example work with current vegan release.
library(vegan)
library(vegan3d)
data(dune)
SiteID <- as.character(1:20) # must be character
NMDS_input <- metaMDS(dune,distance="bray",try=500,k=3,wascores = T)
pl4 <- ordiplot3d(NMDS_input, pch=16, angle=50, main="Fish ion level 3", cex.lab=1.7,cex.symbols=1.5, tick.marks=FALSE) # no with(NMDS,...)
sp <- scores(NMDS_input, choices=1:3, display="species") # no arg scaling in scores.metaMDS
text(pl4$xyz.convert(sp), rownames(sp), cex=0.7, xpd=TRUE)
orditorp(pl4, labels = SiteID, air=1, cex = 1) # character labels w/points in the same location

Pytorch minibatching keeps model from training

I am trying to classify sequences by a binary feature. I have a dataset of sequence/label pairs and am using a simple one-layer LSTM to classify each sequence. Before I implemented minibatching, I was getting reasonable accuracy on a test set (80%), and the training loss would go from 0.6 to 0.3 (averaged).
I implemented minibatching, using parts of this tutorial: https://pytorch.org/tutorials/beginner/chatbot_tutorial.html
However, now my model won’t do better than 70-72% (70% of the data has one label) with batch size set to 1 and all other parameters exactly the same. Additionally, the loss starts out at 0.0106 and quickly gets really really small, with no significant change in results. I feel like the results between no batching and batching with size 1 should be the same, so I probably have a bug, but for the life of me I can’t find it. My code is below.
Training code (one epoch):
for i in t:
model.zero_grad()
# prep inputs
last = i+self.params['batch_size']
last = last if last < len(train_data) else len(train_data)
batch_in, lengths, batch_targets = self.batch2TrainData(train_data[shuffled][i:last], word_to_ix, label_to_ix)
iters += 1
# forward pass.
tag_scores = model(batch_in, lengths)
# compute loss, then do backward pass, then update gradients
loss = loss_function(tag_scores, batch_targets)
loss.backward()
# Clip gradients: gradients are modified in place
nn.utils.clip_grad_norm_(model.parameters(), 50.0)
optimizer.step()
Functions:
def prep_sequence(self, seq, to_ix):
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long)
# transposes batch_in
def zeroPadding(self, l, fillvalue=0):
return list(itertools.zip_longest(*l, fillvalue=fillvalue))
# Returns padded input sequence tensor and lengths
def inputVar(self, batch_in, word_to_ix):
idx_batch = [self.prep_sequence(seq, word_to_ix) for seq in batch_in]
lengths = torch.tensor([len(idxs) for idxs in idx_batch])
padList = self.zeroPadding(idx_batch)
padVar = torch.LongTensor(padList)
return padVar, lengths
# Returns all items for a given batch of pairs
def batch2TrainData(self, batch, word_to_ix, label_to_ix):
# sort by dec length
batch = batch[np.argsort([len(x['turn']) for x in batch])[::-1]]
input_batch, output_batch = [], []
for pair in batch:
input_batch.append(pair['turn'])
output_batch.append(pair['label'])
inp, lengths = self.inputVar(input_batch, word_to_ix)
output = self.prep_sequence(output_batch, label_to_ix)
return inp, lengths, output
Model:
class LSTMClassifier(nn.Module):
def __init__(self, params, vocab_size, tagset_size, weights_matrix=None):
super(LSTMClassifier, self).__init__()
self.hidden_dim = params['hidden_dim']
if weights_matrix is not None:
self.word_embeddings = nn.Embedding.from_pretrained(weights_matrix)
else:
self.word_embeddings = nn.Embedding(vocab_size, params['embedding_dim'])
self.lstm = nn.LSTM(params['embedding_dim'], self.hidden_dim, bidirectional=False)
# The linear layer that maps from hidden state space to tag space
self.hidden2tag = nn.Linear(self.hidden_dim, tagset_size)
def forward(self, batch_in, lengths):
embeds = self.word_embeddings(batch_in)
packed = nn.utils.rnn.pack_padded_sequence(embeds, lengths)
lstm_out, _ = self.lstm(packed)
outputs, _ = nn.utils.rnn.pad_packed_sequence(lstm_out)
tag_space = self.hidden2tag(outputs)
tag_scores = F.log_softmax(tag_space, dim=0)
return tag_scores[-1]
For anyone else with a similar issue, I got it to work. I removed the log_softmax calculation, so this:
tag_space = self.hidden2tag(outputs)
tag_scores = F.log_softmax(tag_space, dim=0)
return tag_scores[-1]
becomes this:
tag_space = self.hidden2tag(outputs)
return tag_space[-1]
I also changed NLLLoss to CrossEntropyLoss, (not shown above), and initialized CrossEntropyLoss with no parameters (aka no ignore_index).
I am not certain why these changes were necessary (the docs even say that NLLLoss should be run after a log_softmax layer), but they got my model working and brought my loss back to a reasonable range (~0.5).

Testing a tensorflow network: in_top_k() replacement for multilabel classification

I've created a neural network in tensorflow. This network is multilabel. Ergo: it tries to predict multiple output labels for one input set, in this case three. Currently I use this code to test how accurate my network is at predicting the three labels:
_, indices_1 = tf.nn.top_k(prediction, 3)
_, indices_2 = tf.nn.top_k(item_data, 3)
correct = tf.equal(indices_1, indices_2)
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
percentage = accuracy.eval({champion_data:input_data, item_data:output_data})
That code works fine. The problem is now that I'm trying to create code that tests if the top 3 items it finds in indices_1 are amongst the top 5 images in indices_2. I know tensorflow has an in_top_k() method, but as far as I know that doesn't accept multilabel. Currently I've been trying to compare them using a for loop:
_, indices_1 = tf.nn.top_k(prediction, 5)
_, indices_2 = tf.nn.top_k(item_data, 3)
indices_1 = tf.unpack(tf.transpose(indices_1, (1, 0)))
indices_2 = tf.unpack(tf.transpose(indices_2, (1, 0)))
correct = []
for element in indices_1:
for element_2 in indices_2:
if element == element_2:
correct.append(True)
else:
correct.append(False)
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
percentage = accuracy.eval({champion_data:input_data, item_data:output_data})
However, that doesn't work. The code runs but my accuracy is always 0.0.
So I have one of two questions:
1) Is there an easy replacement for in_top_k() that accepts multilabel classification that I can use instead of custom writing code?
2) If not 1: what am I doing wrong that results in me getting an accuracy of 0.0?
When you do
correct = tf.equal(indices_1, indices_2)
you are checking not just whether those two indices contain the same elements but whether they contain the same elements in the same positions. This doesn't sound like what you want.
The setdiff1d op will tell you which indices are in indices_1 but not in indices_2, which you can then use to count errors.
I think being too strict with the correctness check might be what is causing you to get a wrong result.

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.

Trying to balance my dataset through sample_weight in scikit-learn

I'm using RandomForest for classification, and I got an unbalanced dataset, as: 5830-no, 1006-yes. I try to balance my dataset with class_weight and sample_weight, but I can`t.
My code is:
X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
But I don't get any improvement on my ratios TPR, FPR, ROC when using class_weight and sample_weight.
Why? Am I doing anything wrong?
Nevertheless, if I use the function called balanced_subsample, my ratios obtain a great improvement:
def balanced_subsample(x,y,subsample_size):
class_xs = []
min_elems = None
for yi in np.unique(y):
elems = x[(y == yi)]
class_xs.append((yi, elems))
if min_elems == None or elems.shape[0] < min_elems:
min_elems = elems.shape[0]
use_elems = min_elems
if subsample_size < 1:
use_elems = int(min_elems*subsample_size)
xs = []
ys = []
for ci,this_xs in class_xs:
if len(this_xs) > use_elems:
np.random.shuffle(this_xs)
x_ = this_xs[:use_elems]
y_ = np.empty(use_elems)
y_.fill(ci)
xs.append(x_)
ys.append(y_)
xs = np.concatenate(xs)
ys = np.concatenate(ys)
return xs,ys
My new code is:
X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5)
X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw)
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})
This is not a full answer yet, but hopefully it'll help get there.
First some general remarks:
To debug this kind of issue it is often useful to have a deterministic behavior. You can pass the random_state attribute to RandomForestClassifier and various scikit-learn objects that have inherent randomness to get the same result on every run. You'll also need:
import numpy as np
np.random.seed()
import random
random.seed()
for your balanced_subsample function to behave the same way on every run.
Don't grid search on n_estimators: more trees is always better in a random forest.
Note that sample_weight and class_weight have a similar objective: actual sample weights will be sample_weight * weights inferred from class_weight.
Could you try:
Using subsample=1 in your balanced_subsample function. Unless there's a particular reason not to do so we're better off comparing the results on similar number of samples.
Using your subsampling strategy with class_weight and sample_weight both set to None.
EDIT: Reading your comment again I realize your results are not so surprising!
You get a better (higher) TPR but a worse (higher) FPR.
It just means your classifier tries hard to get the samples from class 1 right, and thus makes more false positives (while also getting more of those right of course!).
You will see this trend continue if you keep increasing the class/sample weights in the same direction.
There is a imbalanced-learn API that helps with oversampling/undersampling data that might be useful in this situation. You can pass your training set into one of the methods and it will output the oversampled data for you. See simple example below
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=1)
x_oversampled, y_oversampled = ros.fit_sample(orig_x_data, orig_y_data)
Here it the link to the API: http://contrib.scikit-learn.org/imbalanced-learn/api.html
Hope this helps!