Set specified factor level as reference in GT regression? - gtsummary

I am using gtsummary package to generate tables from logistic regressions.
I would like to, for example, use the stage level "T3" in the trial data as the reference level, instead of the default "T1". How can I do that within this example code?
I aim to do this for both univariate and multiple variable logistic regression, so I am presuming the answer shall work on both scenarios.
library(gtsummary)
library(dplyr)
trial %>%
dplyr::select(age, trt, marker, stage, response, death, ttdeath) %>%
tbl_uvregression(
method = glm,
y = death,
method.args = list(family = binomial),
exponentiate = TRUE,
pvalue_fun = function(x) style_pvalue(x, digits = 2)) %>%
# overrides the default that shows p-values for each level
add_global_p() %>%
# adjusts global p-values for multiple testing (default method: FDR)
add_q() %>%
# bold p-values under a given threshold (default 0.05)
bold_p() %>%
# now bold q-values under the threshold of 0.10
bold_p(t = 0.10, q = TRUE) %>%
bold_labels() %>% as_gt()
Sincerely,
nelly

I managed to solve my own problem by using the forcats function "fct_relevel" to set the desired level for a categorical variable as the reference.
trial$stage <- forcats::fct_relevel(trial$stage, "T3")

Related

How to run an exponential decay mixed model?

I am not familiar with nonlinear regression and would appreciate some help with running an exponential decay model in R. Please see the graph for how the data looks like. My hunch is that an exponential model might be a good choice. I have one fixed effect and one random effect. y ~ x + (1|random factor). How to get the starting values for the exponential model (please assume that I know nothing about nonlinear regression) in R? How do I subsequently run a nonlinear model with these starting values? Could anyone please help me with the logic as well as the R code?
As I am not familiar with nonlinear regression, I haven't been able to attempt it in R.
raw plot
The correct syntax will depend on your experimental design and model but I hope to give you a general idea on how to get started.
We begin by generating some data that should match the type of data you are working with. You had mentioned a fixed factor and a random one. Here, the fixed factor is represented by the variable treatment and the random factor is represented by the variable grouping_factor.
library(nlraa)
library(nlme)
library(ggplot2)
## Setting this seed should allow you to reach the same result as me
set.seed(3232333)
example_data <- expand.grid(treatment = c("A", "B"),
grouping_factor = c('1', '2', '3'),
replication = c(1, 2, 3),
xvar = 1:15)
The next step is to create some "observations". Here, we use an exponential function y=a∗exp(c∗x) and some random noise to create some data. Also, we add a constant to treatment A just to create some treatment differences.
example_data$y <- ave(example_data$xvar, example_data[, c('treatment', 'replication', 'grouping_factor')],
FUN = function(x) {expf(x = x,
a = 10,
c = -0.3) + rnorm(1, 0, 0.6)})
example_data$y[example_data$treatment == 'A'] <- example_data$y[example_data$treatment == 'A'] + 0.8
All right, now we start fitting the model.
## Create a grouped data frame
exampleG <- groupedData(y ~ xvar|grouping_factor, data = example_data)
## Fit a separate model to each groupped level
fitL <- nlsList(y ~ SSexpf(xvar, a, c), data = exampleG)
## Grab the coefficients of the general model
fxf <- fixed.effects(fit1)
## Add treatment as a fixed effect. Also, use the coeffients from the previous
## regression model as starting values.
fit2 <- update(fit1, fixed = a + c ~ treatment,
start = c(fxf[1], 0,
fxf[2], 0))
Looking at the model output, it will give you information like the following:
Nonlinear mixed-effects model fit by maximum likelihood
Model: y ~ SSexpf(xvar, a, c)
Data: exampleG
AIC BIC logLik
475.8632 504.6506 -229.9316
Random effects:
Formula: list(a ~ 1, c ~ 1)
Level: grouping_factor
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
a.(Intercept) 3.254827e-04 a.(In)
c.(Intercept) 1.248580e-06 0
Residual 5.670317e-01
Fixed effects: a + c ~ treatment
Value Std.Error DF t-value p-value
a.(Intercept) 9.634383 0.2189967 264 43.99329 0.0000
a.treatmentB 0.353342 0.3621573 264 0.97566 0.3301
c.(Intercept) -0.204848 0.0060642 264 -33.77976 0.0000
c.treatmentB -0.092138 0.0120463 264 -7.64867 0.0000
Correlation:
a.(In) a.trtB c.(In)
a.treatmentB -0.605
c.(Intercept) -0.785 0.475
c.treatmentB 0.395 -0.792 -0.503
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.93208903 -0.34340037 0.04767133 0.78924247 1.95516431
Number of Observations: 270
Number of Groups: 3
Then, if you wanted to visualize the model fit, you could do the following.
## Here we store the model predictions for visualization purposes
predictionsDf <- cbind(example_data,
predict_nlme(fit2, interval = 'conf'))
## Here we make a graph to check it out
ggplot()+
geom_ribbon(data = predictionsDf,
aes( x = xvar , ymin = Q2.5, ymax = Q97.5, fill = treatment),
color = NA, alpha = 0.3)+
geom_point(data = example_data, aes( x = xvar, y = y, col = treatment))+
geom_line(data = predictionsDf, aes(x = xvar, y = Estimate, col = treatment), size = 1.1)
This shows the model fit.

Alternative to Silhouette scores to avoid incorrect conclusions

I am using silhouette scores as a post-hoc measure of cluster validity for clusters derived from DBSCAN, but the metric fails to accurately capture what is happening in a particular situation that occurs in my data, and I am looking for alternatives.
The issue occurs when there are two nearby, but clearly separated clusters, which a silhouette score marks as not well distinguished. I present an example below. Is there an alternative metric that is used in clustering that might resolve this issue with silhouette scores?
library(dplyr)
library(ggplot2)
n = 1000
x = rnorm(n, mean = 0, sd = 0.5)
y = rnorm(n, mean = 0, sd = 0.5)
split = 0.05
df = data.frame(x = x, y = y)
df = df %>%
dplyr::filter(x < -1*split | x > split) %>%
mutate(
group = ifelse(x < split, 1, 2)
)
plot(df$x, df$y, col = factor(df$group))
silhouette_score = cluster::silhouette(df$group, dist = dist(df$x, df$y)) %>%
as.data.frame()
silhouette_score %>%
group_by(cluster) %>%
summarise(mean(sil_width))
ggplot(silhouette_score, aes(x = sil_width)) +
geom_histogram() +
facet_grid(~cluster)

Exponentiate linear regression coefficient using the gtsummary package

I want to exponentiate my loge values in the gtsummary package but I am getting error when I said exponentiate= TRUE.
Can someone kindly help me?
mu <- lm(log(iron)~ treatment + log(erythroferrone)+ log(epo)+ log(crp)+ log(hepciden), data=endline1) %>% tbl_regression(exponentiate = TRUE)%>% as_gt() theme_gtsummary_journal(journal = "nejm")
Error in tidy_and_attach(., tidy_fun = tidy_fun, conf.int = conf.int, : exponentiate = TRUE is not valid for this type of model.

MNIST dataset boosting

I am trying to apply Gradient Boosting to the MNIST dataset. This is my code:
library(dplyr)
library(caret)
mnist <- snedata::download_mnist()
mnist_num <- as.data.frame(lapply(mnist[1:10000,], as.numeric)) %>%
mutate(id = row_number())
mnist_num <- mnist_num[,sapply(mnist_num, function(x){max(x) - min(x) > 0})]
mnist_train <- sample_frac(mnist_num, .70)
mnist_test <- anti_join(mnist_num, mnist_train, by = 'id')
set.seed(5000)
library(gbm)
boost_mnist<-gbm(Label~ .,data=mnist_train, distribution="bernoulli", n.trees=70,
interaction.depth=4, shrinkage=0.3)
It shows the following error:
"Error in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, : Bernoulli requires the response to be in {0,1}"
What is wrong here? Can anyone show me the code to correctly do it?
The error
Error in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, : Bernoulli requires the response to be in {0,1}
is due to the choice of the distribution, you should choose the multinomial instead of the bernoulli, because the bernoulli distribution only works with dichotomous response and the mnist label goes from 1 to 10.

Why CLARA clustering method does not give the same classes as when I do clustering manually?

I am using CLARA (in 'cluster' package). This method is supposed to assign each observation to the closest 'medoid'. But when I calculate the distance of medoids and observations manually and assign them manually, the results are slightly different (1-2 percent of occurrence probability). Does anyone know how clara calculates dissimilarities and why I get different clustering results?
This is the function I use to do clustering manually:
Manual.Clustering <- function(Data,Clusters,Weights=NULL) {
if (is.null(Weights)) Weights <- rep(1,length(Data));
if (length(Weights)==1) Weights <- rep(Weights,length(Data));
Data2 <- Data[,rownames(Clusters)];
Data2 <- Weights*Data2;
dist <- matrix(NA,nrow=nrow(Data),ncol=ncol(Clusters));
for (i in 1:ncol(Clusters)) {
dist[,i] <- Dist2Center(Data2,Clusters[,i],Weights=NULL);
}
classes <- apply(dist,1,which.min);
Out <- cbind(Data,classes);
colnames(Out) <- c(colnames(Data),"Class");
Freq <- FreqTable(Out[,"Class"]);
Freq <- as.data.frame(Freq);
return(list(Data=Out,Freq=Freq));
}
=====================================
Dist2Center <- function(Data,Center,Weights=NULL) {
if (is.null(Weights)) Weights <- matrix(rep(1,nrow(Data)),ncol=1);
if (length(Weights)==1) Weights <- rep(Weights,nrow(Data));
if (ncol(Data)!=length(Center)) stop();
Dist <- Weights*apply(Data,1,function(x){sqrt(sum((x-Center)^2,na.rm=T))} );
return(Dist);
}
Data: Original Data.
Clusters: t(Medoids).
Medoids: 'medoids' picked by clara.
Dist2Center: A function which calculates Euclidean distance of each observation from each Medoids.
Behnam.
I found that this happens only when input data has NA values. For inputs without NAs, the results of my algorithm and clara are identical. I think this is related to how clara handles NA values while calculating the distances of observations to medoids. Any comment? Any suggestion for replacing clara with a better algorithm compatible with large datasets and NA values?
Having a look at the Clara C code, I found that Clara manipulates the distances if there is any missing values. The line " dsum *= (nobs / pp) " in the code shows that it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. That is why it does not give the same results as my algorithm.