Combine Confidence Intervals and Odds Ratios + Adding Starts for P-Values in Gtsummary - confidence-interval

Two simple questions here. First, is there a way to combine the confidence intervals and the odds ratios into a single column with gtsummary for the tblregression/tbl_uvyregression functions? Second, is it possible to include stars to indicate the significance level for the p-values (i.e. *<.05, **<.01, ***<.005) so I do not have to have a p-value column whenever I make a table? Thanks

As of gtsummary v1.4.0, this is possible with the add_significance_stars() function.
library(gtsummary)
#> #Uighur
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
lm(marker ~ age + grade, trial) %>%
tbl_regression() %>%
add_significance_stars(
pattern = "{estimate} ({conf.low}, {conf.high}){stars}",
hide_se = TRUE
) %>%
modify_header(estimate ~ "**Beta (95% CI)**")
Created on 2021-04-14 by the reprex package (v2.0.0)

Related

na.approx and na.locf not behaving properly

I'm trying to calculate imputated values for a time series for different countries. This piece of code worked fine before, but now the impuated values are all wrong ... I can't figure out the problem, I've tried everything I could think of.
Our rules are:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
If values are missing in the middle of a time series, linear extrapolation is used.
# load library for imputation
library(zoo)
# expand table to show NAs
output_table_imp = expand(output_table, transport_mode, year, country_code)
output_table_imp = full_join(output_table_imp, output_table)
# add imputated values
output_table_imp <- output_table_imp %>%
group_by(transport_mode, country_code) %>%
mutate(fatalities_imp= na.approx(fatalities,na.rm=FALSE)) %>% # linear interpolation
mutate(fatalities_imp= na.locf.default(fatalities_imp,na.rm=FALSE)) %>% # missing values at the end of a time series (copy last non-NA value)
mutate(fatalities_imp= na.locf(fatalities_imp,fromLast=TRUE, na.rm=FALSE)) %>% # missing values at the start of a time series (copy first non-NA value)
My data frame consists of a couple of columns: transport_mode, country_code, year, fatalities. I'm not sure how I can share my data here? It's a large table with 3600 observations ...
These are the original numbers:
And these are the imputated values. You can see straight away that there is a problem for CY, IE and LT.
The data frame looks like this:
Your code looks somehow overly complicated. Don't know about the zoo details - but pretty sure you could get it also to work.
With the imputeTS package you could just take your whole data. frame (it assumes each column is a separate time series) and the package performs imputation for each of this series.
(unfortunately your code has no data, but I guess this would be your output_table_imp data.frame after expansion)
Just like this:
library("imputeTS")
na_interpolation(output_table_imp, option = "linear")
We also don't have to change something for NA treatment at the beginning and at the end, since your requirements are the default in the na_interpolation function.
These were your requirements:
Values missing at the end of a time series are given the last known value in the series.
Values missing at the beginning of a time series are given the first known value in the series.
Here a toy example:
# Test time series with NAs at start, middle, end
test <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
# Perform linear interpolation
na_interpolation(test, option = "linear")
#Results
> 1 1 1 2 3 4 5 6 7 8 8 8
So see, this works perfectly fine.
Works also perfectly with a data.frame (as a said, a column is interpreted as a time series):
# Create three time series and combine them into 1 data.frame
ts1 <- c(NA,NA,1,2,3,NA,NA,6,7,8,NA,NA)
ts2 <- c(NA,1,1,2,3,NA,3,6,7,8,NA,NA)
ts3 <- c(NA,3,1,2,3,NA,3,6,7,8,NA,NA)
df <- data.frame(ts1,ts2,ts3)
na_interpolation(df, option = "linear")

calculate time between first and last var separated by subgroup in Tableau

I have a dataset which has from-To and an Id. In one Id, there are multiple from-To as sequence and I want to calculate the time difference from the first observation till the last observation for each Id.
In R it looks like:
tmp2 <- tmp %>% group_by(Id) %>% slice(c(1,n())) %>% ungroup()
Please any help would be highly appreciated

How do I joint test a multi-level categorical effect in ipython using statsmodels?

I am using the Ordinary Least Squares (ols) function in statsmodels in ipython to fit a linear model where one covariate (City) is a multi-level categorical effect:
result=smf.ols(formula="Y ~ C(City) + X*C(Group)",data=s).fit();
(X is continuous, Group is a binary categorical variable).
When I do results.summary(), I get one row per level of City, however, what I would like to know is the overall significance of the 'City' covariate (i.e., compare Y~C(City)+X*C(Group) with the partial model Y~X*C(Group)).
Is there a way of doing it?
thanks in advance
Thank you user333700!
Here's an elaboration of your hint. I generate data with a 3-level categorical variable, use statsmodels to fit a model, and then test all levels of the categorical variable jointly:
# 1. generate data
def rnorm(n,u,s):
return np.random.standard_normal(n)*s+u
a=rnorm(100,-1,1);
b=rnorm(100,0,1);
c=rnorm(100,+1,1);
n=rnorm(300,0,1); # some noise
y=np.concatenate((a,b,c))+n
g=np.zeros(300);
g[0:100]=1
g[100:200]=2
g[200:300]=3
df=pd.DataFrame({'Y':y,'G':g,'N':n});
# 2. fit model
r=smf.ols(formula="Y ~ N + C(G)",data=df).fit();
r.summary()
# 3. joint test
print r.params
A=np.identity(len(r.params)) # identity matrix with size = number of params
GroupTest=A[1:3,:] # for the categorical var., keep the corresponding rows of A
CovTest=A[3,:] # row for the continuous var.
print "Group effect test",r.f_test(GroupTest).fvalue
print "Covariate effect test",r.f_test(CovTest).fvalue
The result should be something like this:
Intercept -1.188975
C(G)[T.2.0] 1.315898
C(G)[T.3.0] 2.137431
N 0.922038
dtype: float64
Group effect test [[ 120.86097747]]
Covariate effect test [[ 259.34155851]]
brief answer
you can use anova_lm (type 3) directly or use f_test or wald_test and either construct the constraint matrix or provide the constraints of the hypothesis in the form of a sequence of formulas.
http://statsmodels.sourceforge.net/devel/anova.html
http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.f_test.html

Software solutions for visualizing similarity/dissimilarity between pairs of people

I have calculations of the similarity/dissimilarity between any two pairs of ~1200 people on a scale of 0-1.
I would like to visualize these relationships on an X-Y plane. Are there are any software tools that can take these relationships and put people close to each other who have high similarity and far away from each who have high dissimilarity?
You need to use Multidimensional Scaling. The objective is to identify a transformation of your data that will express the relative differences in similarity with linear distances in 2 dimensions. You will want to use classical or metric based scaling for this.
Here is an example with R:
The code is fairly straightforward. The magick is all in cmdscale and the use of vegdist to create the distance matrix for cmdscale. Then you can use R or export the data somewhere else for visualization.
## load libraries
library(ggplot2) # for charting
library(vegan) # for jaccard
## simulate some data - 1200 rows, 5 features/cols/fields
features <- matrix(abs(rnorm(1200*5)),ncol=5)
rownames(features) <- paste("P", seq(1:1200), sep="")
## calculate jaccard distances
d <- vegdist(features, method = "jaccard")
## Multidimensional Scaling
fit <- cmdscale(d,eig=TRUE, k=2)
# prepare the data for plotting
mds <- data.frame(
x = fit$points[,1],
y = fit$points[,2],
name = rownames(features))
# plot
ggplot(mds, aes(x=x,y=y,label=color=name)) + geom_text(size=1)
## bonus visualization! - a dendrogram
fit <- hclust(d, method="ward")
plot(fit)

How can I find low regions in a graph using Perl/R?

I'm examining some biological data which is basically a long list (a few million values) of integers, each saying how well this position in the genome is covered. Here is a graphical example for a data set:
I would like to look for "valleys" in this data, that is, regions which are significantly lower than their surrounding environment.
Note that the size of the valleys I'm looking for is not really known - it may range from 50 bases to a few thousands. Defining what is a valley is of course one of the questions I'm struggling with, but the previous examples are relatively easy for me:
What kind of paradigms would you recommend using to find those valleys? I mainly program using Perl and R.
Thanks!
We do peak detection (and valley detection) using running medians and median absolute deviation. You can specify how much deviation from the running median means a peak.
In a next step, we use a binomial model to check which regions contain more "extreme" values than can be expected. This model (basically a score test) results in "peak regions" instead of single peaks. Turning it around to get "valley regions" is trivial.
The running median is calculated using the function weightedMedian from the package aroma.light. We use the embed() function to make a list of "windows" and apply a kernel function on it.
The application of the weighted median :
center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)
Here tmp is the temporary data vector and wdw the window size (has to be uneven). tmp is constructed by adding (wdw-1)/2 NA values at every side of the data vector. the weights are constructed using a customized function. For the mad we use the same procedure, but then on diff(data) instead of the data itself.
Running Sample code :
require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
type <- match.arg(type)
x <- seq(-1,1,length.out=n)
out <-switch(type,
gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
epanechnikov=0.75*(1-x^2),
biweight=15/16*(1-x^2)^2,
triweight=35/32*(1-x^2)^3,
cosinus=pi/4*cos(x*pi/2),
)
out <- out/sum(out)*n
return(out)
}
# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
n <- length(x)
if(missing(w)) w<-rep(1,n)
w <- w[!is.na(x)]
x <- x[!is.na(x)]
if(sum(w)!=n) w <- w/sum(w)*n
phat <- sum(x*w)/n
z <- (phat-p0)/sqrt(p0*(1-p0)/n)
p <- 1-pnorm(z)
return(p)
}
# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
extra <- rep(NA,n)
x <- c(extra,x,extra)
out <- embed(x,2*n+1)
return(out)
}
# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n
running.score <- function(sel,n=20,w,p0){
if(missing(w)) w<- rep(1,2*n+1)
if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)]) # null hypothesis
out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
return(out)
}
# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
wdw <- 2*n+1
if(missing(w)) w <- rep(1,wdw)
center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
mad <- median(abs(x-center))*cte
return(list(med=center,mad=mad))
}
##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)
peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))
pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)
plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")
The sample code above is developed to find regions with big fluctuations rather than valleys. I adapted it a bit, but it's not optimal. On top of that, for series larger than 20000 values you need a whole lot of memory, I can't run it on my computer any more.
Alternatively, you could work with an approximation of the numerical derivative and second derivative to define valleys. In your case, this might even work better. A pragmatic way of calculating the derivatives and the minima/maxima of the first derivative :
#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)
op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)
You can define a valley by different criterion :
depth
width
volume (depth*width)
You might also have valley in a big mountain, do you want these too ?
For example there is a valley here : 1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3
Try to explain with more details how YOU (or any expert in your field) would choose the valleys given the data
You might want to look at watershed
You might want to try the peak detection function to identify the regions of interest. The desired minimum width of the valleys can be specified with the span parameter.
It might be a good idea to smooth the data first, to get rid of the noise peaks like the one in the right "valley" of the blue graph. A simple stats::filter should be enough.
The final step would be to check the depth of the found "valleys". This really depends on your requirements. As a first approximation, you can simply compare the peak value with the median level of the data.