Create average over range of values (Income range) in factor except one value - range

I want to create an average over the range of values in order to receive numerical values so that I can run a regression. However, when I try my code, the value "100000" gets excluded so that I don't have it in my data anymore.
Can anyone tell me how to create an average over the ranges except from "100000"?
df1$Income[df1$Income %in% c("Prefer not to say")] <- NA
df1$Income <- gsub("[,€]","", df1$Income)
df1$Income <- gsub("[[:alpha:]]","", df1$Income)
table(df1$Income)
df1$Inc <- factor(df1$Income, levels = c("0 – 9999", "10000 – 19999", "20000 – 29999", "30000 – 39999",
"40000 – 59999", "60000 – 79999", "80000 – 99999", "100000"), ordered = TRUE)
df1$Inc <- as.character(df1$Inc)
df1$av.Inc <- sapply(strsplit(df1$Inc, " – "), function(i) mean(as.numeric(i)))
table(df1$av.Inc)

Related

Gtsummary table values less than -1 outputting as 1.00?

Since installing the newest version of R, all my gtsummary table values less than -1 have been outputting to 1.00. Does anyone have insight on how to fix this very weird issue?
Here is example code:
library(tidyverse)
library(gtsummary)
library(haven)
library(mice)
library(googlesheets4)
data <- read_sheet("https://docs.google.com/spreadsheets/d/1yyw-0xseZSLjD4jc8sw7IksN-S0M3vcKWHy4ksMPL4c/edit?usp=sharing")
datami <- mice(data, m = 23, seed=10)
datareg <- with(datami,
lm(SUD ~ NUM + MIND +
AGE + SEX + CRAVE) )
table <- tbl_regression(datareg,
estimate_fun = purrr::partial(style_ratio, digits = 2),
pvalue_fun = ~style_pvalue(.x, digits = 2),
add_estimate_to_reference_rows = TRUE
) %>% modify_header(label="**Predictor**",estimate="**Unstandardized Coefficient**") %>%
modify_footnote(update = c(p.value, ci, estimate) ~ "Reference group")%>%
modify_caption("Table: Multiple Imputation Predicting Variable")
table
Have re-installed R & gtsummary multiple times to no avail.
You're using style_ratio() to round/format the estimates: this is meant to round odds ratios, risk ratios, etc., which are all positive numbers. Update this to use style_number().
I should update the ratio function to have better behavior when negative values are passed.

MCMCglmm questions: multiple species and ultrametric trees

These questions are related to my other question at Phylogenetic model using multiple entries for each species
Thanks to #thomas-guillerme, I was able to start running an MCMCglmm model.
Although I had no problem running some of my example files in which I had a single entry for each of the species in my tree, I found an error message when trying to run my original dataset, which consists of thousands of entries for each of the species in my tree. When running:
comp_data <- comparative.data(phy = my_tree, data =my_data, names.col = species, vcv = TRUE)’
I got an error:
'Error in row.names<-.data.frame(tmp, value = value) : duplicate
'row.names' are not allowed In addition: Warning message: non-unique
values when setting 'row.names': ‘Species1’, ‘Species2’,
‘Species3’, ‘Species4’,...
I was surprised because I am using MCMCglmm and not PGLS because of the chance of using multiple entries for each species.
I tried the workaround of make the species name unique but in that case only the first entry of each species is recognized later in the model (because it corresponds with the name in my_tree).
Moreover, I had problems with having my tree recognized as ultrametric. I checked it using
'is.ultrametric(my_tree)'
Got:
FALSE
I tried:
function (phy) { if(any(is.ultrametric(my_tree)) == FALSE) { my_tree <- lapply(my_tree, chronoMPL) class(my_tree) <- "Phylo"
}
}
But these lines apparently do not solve the problem. Thanks in advance for your help.
Hard to tell without a running example but for the second question at least, it seems that the bug comes from the phy argument not being passed to the function at all (it's using my_tree
check.fun <- function(my_tree) {
if(any(is.ultrametric(my_tree)) == FALSE) {
my_tree <- lapply(my_tree, chronoMPL)
class(my_tree) <- "Phylo"
}
return(my_tree)
}
For the first point, you might want to try to run it through the mulTree package that does a lot of housekeeping:
## Loading/installing the package
library(devtools)
install_github("TGuillerme/mulTree")
library(mulTree)
## Loading the example data
data(lifespan)
## Randomly combining trees
combined_trees <- tree.bind(x = trees_mammalia, y = trees_aves, sample = 2,
root.age = 250)
We can then generate an example with multiple specimens per species:
## Subset of the data
data <- lifespan_volant[sample(nrow(lifespan_volant), 30),]
## Create a dataset with two specimen per species
data <- rbind(cbind(data, specimen = rep("spec1", 30)), cbind(data,
specimen = rep("spec2", 30)))
Note that the first column contains the list of species with multiple specimens per species (specified in column $specimen)
head(data[order(data$species),])
# species class longevity mass volant specimen
#16 Addax_nasomaculatus Mammalia 0.8413927 1.8227058 nonvolant spec1
#161 Addax_nasomaculatus Mammalia 0.8413927 1.8227058 nonvolant spec2
#140 Anser_anser Aves 0.9929849 0.5993055 volant spec1
#1401 Anser_anser Aves 0.9929849 0.5993055 volant spec2
#21 Antilope_cervicapra Mammalia 0.6055864 1.4910746 nonvolant spec1
#211 Antilope_cervicapra Mammalia 0.6055864 1.4910746 nonvolant spec2
You can then use the clean.data function to make sure the trees match the dataset (specifying which column contains the species names)
## Making sure both the trees and the data match
cleaned_data <- clean.data(data, combined_trees, data.col = "species")
## Only using the cleaned version
trees <- cleaned_data$tree
data <- cleaned_data$data
You can find the eventual dropped tips/rows in cleaned_data$dropped_tips and cleaned_data$dropped_rows:
## Creates a mulTree object specifying species AND specimens as random terms
mulTree_data <- as.mulTree(data, trees, taxa = "species",
rand.terms = ~species+specimen)
## formula to test
test_formula <- longevity ~ mass + volant
## MCMC parameters (number of generations, thin/sampling, burnin)
mcmc_parameters <- c(101000, 10, 1000)
## priors
mcmc_priors <- list(R = list(V = 1/2, nu = 0.002),
G = list(G1 = list(V = 1/2, nu = 0.002)))
## Running MCMCglmm on multiple trees
mulTree(mulTree_data, formula = test_formula, parameters = mcmc_parameters,
priors = mcmc_priors, output = "longevity.example", ESS = 50)
To analyse the resulting files, you can use read.mulTree and subsequent functions (see the mulTree manual).

Prediction in R doesn't work following rpart

I am working on a tree regression. Everything works fine with my code but I don't get the predicted values at all. Instead I get all values for my y variable (response variable). Here's the code:
Separating in train and test set for data
`sample = sample.split(Data, SplitRatio = .80)
train = subset(Data, sample == TRUE)
test = subset(Data, sample == FALSE)
varYTrain <- train[c(3)]
varYTest <- test[c(3)]
varXTrain <- train[c(5:27)]
varXTest <- test[c(5:27)]`
Model
`x <- cbind(varXTrain,varYTrain)
fit <- rpart(as.matrix(varYTrain) ~ ., data = x, method="class")
summary(fit)`
This one doesn't work as I don't get predictions based on test data set for an unknown reason
`predicted <- predict(fit, data=varXTest)
summary(predicted)`
I would also like in the end for the output to compare predicted values to real values in my dataset, can I do that?
Thank you very much and don't hesitate to ask me a question if I am not clear enough it's my first time posting.
Cheers

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.

Consolidating a data table in Scala

I am working on a small data analysis tool, and practicing/learning Scala in the process. However I got stuck at a small problem.
Assume data of type:
X Gr1 x_11 ... x_1n
X Gr2 x_21 ... x_2n
..
X GrK x_k1 ... x_kn
Y Gr1 y_11 ... y_1n
Y Gr3 y_31 ... y_3n
..
Y Gr(K-1) ...
Here I have entries (X,Y...) that may or may not exist in up to K groups, with a series of values for each group. What I want to do is pretty simple (in theory), I would like to consolidate the rows that belong to the same "entity" in different groups. so instead of multiple lines that start with X, I want to have one row with all values from x_11 to x_kn in columns.
What makes things complicated however is that not all entities exist in all groups. So wherever there's "missing data" I would like to pad with for instance zeroes, or some string that denotes a missing value. So if I have (X,Y,Z) in up to 3 groups, the type I table I want to have is as follows:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 N/A N/A y_31 y_32
Z N/A N/A z_21 z_22 N/A N/A
I have been stuck trying to figure this out, is there a smart way to use List functions to solve this?
I wrote this simple loop:
for {
(id, hitlist) <- hits.groupBy(_.acc)
h <- hitlist
} println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
to able to generate the tables that look like the example above. Note that, my original data is of a different format and layout,but that has little to do with the problem at hand, thus I have skipped all steps regarding parsing. I should be able to use groupBy in a better way that actually solves this for me, but I can't seem to get there.
Then I modified my loop mapping the hits to ratios and appending them to one another:
for ((id, hitlist) <- hits.groupBy(_.acc)){
val l = hitlist.map(_.ratios).foldRight(List[Double]()){
(l1: List[Double], l2: List[Double]) => l1 ::: l2
}
println(id + "\t" + l.mkString("\t"))
//println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
}
That gets me one step closer but still no cigar! Instead of a fully padded "matrix" I get a jagged table. Taking the example above:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 y_31 y_32
Z z_21 z_22
Any ideas as to how I can pad the table so that values from respective groups are aligned with one another? I should be able to use _.sampleId, which holds the "group membersip" for each "hit", but I am not sure how exactly. ´hits´ is a List of type Hit which is practically a wrapper for each row, giving convenience methods for getting individual values, so essentially a tuple which have "named indices" (such as .acc, .sampleId..)
(I would like to solve this problem without hardcoding the number of groups, as it might change from case to case)
Thanks!
This is a bit of a contrived example, but I think you can see where this is going:
case class Hit(acc:String, subAcc:String, value:Int)
val hits = List(Hit("X", "x_11", 1), Hit("X", "x_21", 2), Hit("X", "x_31", 3))
val kMax = 4
val nMax = 2
for {
(id, hitlist) <- hits.groupBy(_.acc)
k <- 1 to kMax
n <- 1 to nMax
} yield {
val subId = "x_%s%s".format(k, n)
val row = hitlist.find(h => h.subAcc == subId).getOrElse(Hit(id, subId, 0))
println(row)
}
//Prints
Hit(X,x_11,1)
Hit(X,x_12,0)
Hit(X,x_21,2)
Hit(X,x_22,0)
Hit(X,x_31,3)
Hit(X,x_32,0)
Hit(X,x_41,0)
Hit(X,x_42,0)
If you provide more information on your hits lists then we could probably come with something a little more accurate.
I have managed to solve this problem with the following code, I am putting it here as an answer in case someone else runs into a similar problem and requires some help. The use of find() from Noah's answer was definitely very useful, so do give him a +1 in case this code snippet helps you out.
val samples = hits.groupBy(_.sampleId).keys.toList.sorted
for ((id, hitlist) <- hits.groupBy(_.acc)) {
val ratios =
for (sample <- samples)
yield hitlist.find(h => h.sampleId == sample).map(_.ratios)
.getOrElse(List(Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN))
println(id + "\t" + ratios.flatten.mkString("\t"))
}
I figure it's not a very elegant or efficient solution, as I have two calls to groupBy and I would be interested to see better solutions to this problem.