Why do I get NA values after filtering a dataset and running a regression based on that set? - filtering

I am using the gpa1 dataset in R and am trying to filter out all instances where a student has worked less than or equal to 19 hours and does not volunteer or participate in any clubs. I can run the filter and save the data fine, and get the results I want (job19=1, clubs=0, and voluntr=0, but when I try to run a regression based on the new dataset the job19 values just come up as "NA"
This is the code that I used:
PTjob19 <- filter(gpa1, job19 == 1, voluntr == 0, clubs == 0)
View(PTjob19)
olsreg9 <- lm(colGPA~job19 + age + male + hsGPA + ACT + siblings + skipped + fathcoll + mothcoll, data=PTjob19)
I expected that job19 would have some kind of value be it negative or positive, but the values were just listed as "NA." Cam someone please help?

So I figured this out. The reason I was getting an "NA" value was that I wasn't giving R anything to compare. I was trying to run a regression where people only worked less than 20 jobs, which of course R couldn't give me stats on.

Related

gtsummary's tbl_regression() returning odd result

I'm using gtsummary::tbl_regression to format the output of the following code:
dat.mi <- import(here("temp", "adTurn_mi.rds"))
fit1 <- glm(quit ~ sex + age + race + education + experience + tenure_sr +
salary + anymc + catcap + medi + nonpro + rural,
family = 'binomial',
weights = svy_weight,
data = complete(dat.mi)
) |>
tbl_regression(exponentiate = TRUE)
dat.mi is MICE data (unfortunately I can't provide a truly reproducible example as the data is confidential), but is completed in the call to the glm() function. tbl_regression formats everything beautifully, with the exception of one line, and I cannot figure out what that one line is doing or how to fix it. Instead of returning an exponentiated coeficient, it returns a string of numbers separated by commas (please see the Masters degree line in the attached image). Is this a known issue with a similarly known and easy fix? Or am I just really good at breaking stuff?

Translating glmer (binomial) into jags to include a correlated random effect (time)

Context:
I have a 12 item risk assessment where individuals are given a rating from 0-4 (4 being the highest risk).  The risk assessment can be done multiple times for each individual (max = 19, but most only have less than 5 measurements).
The baseline level of risk varies by individual so I am looking for a random intercepts model, but also need to reflect the dynamic nature of the risk ie adding 'time' as a random coefficient.
The outcome is binary:
further offending (FO.bin) which occurs at the measurement level and would mean that I am essentially looking at what dynamic changes have occurred within one or more of the 12 items and how they have contributed to the individual committing a further offence in the period between the measurements
Ultimately what I am essentially looking to do is to predict whether an individual will offend in the future, based on other's (who share the same characteristics) assessment history, contextual factors and factors which may change over time.  
Goal:
I wish to add to my 'basic' model by adding time-varying (level 1) and time-invariant (level 2) predictors:
Time varying include dummy variables around the criminal justice process such as non-compliance, going to court and spending time in custody.   These are reflected as being an 'event' which has occurred in the period between assessments
Time invariant includes dummy variables such as being female, being non-White, and continuous predictors such as age at time of first offence
I've managed to set this up OK using lmer4 and have some potentially interesting results from adding the level 1 and level 2 predictors including where there are interactions and cross-interactions.  However, the complexity of the enhanced models is throwing up all kinds of warning messages including ones about failing to converge.   I therefore feel that it would be appropriate to switch to a Bayesian framework using Rjags so that I can feel more confident about my findings. 
The Problem:
Basically it is one of translation.  This is my 'basic' model based on time and the 12 items in the risk assessment using lme4:
Basic_Model1 <- glmer(BinaryResponse ~ item1 + item2 + item3 + ... + item12 + time + (1+time|individual), data=data, family=binomial)
This is my attempt to translate this into a BUGS model:
# the number of Risk Assessments = 552
N <-nrow(data)                                                            
# number of Individuals (individual previously specified) = 88
J <- length(unique(Individual))                                           
# the 12 items (previously specified)
Z <- cbind(item1, item2, item3, item4, ... item12)                        
# number of columns = number of predictors, will increase as model enhanced
K <- ncol(Z)                                                              
## Store all data needed for the model in a list
jags.data1 <- list(y = FO.bin, Individual =Individual, time=time, Z=Z, N=N, J=J, K=K)                   
model1 <- function() {
    for (i in 1:N) {
    y[i] ~ dbern(p[i])
    logit(p[i]) <- a[Individual[i]] + b*time[i]
  }
 
  for (j in 1:J) {
    a[j] ~ dnorm(a.hat[j],tau.a)
    a.hat[j]<-mu.a + inprod(g[],Z[j,])
  }
  b ~ dnorm(0,.0001)
  tau.a<-pow(sigma.a,-2)
  sigma.a ~ dunif(0,100)
 
  mu.a ~ dnorm (0,.0001)
  for(k in 1:K) {
    g[k]~dnorm(0,.0001)
  }
}
write.model(model1, "Model_1.bug")
Looking at the output, my gut feeling is that I've not added the varying coefficient for time and that what I have done so far is only the equivalent of
Basic_Model2 <- glmer(BinaryResponse ~ item1 + item2 + item3 + ... + item12 + time + (1|individual), data=data, family=binomial)
How do I tweak my BUGS code to reflect time as a varying co-efficient ie Basic_Model1 ? 
Based on the examples I have managed to find, I know that I need to make an additional specification in the J loop so that I can monitor the U[j], and there is a need to change the second part of the logit statement involving time, but its got to the point where I can't see the wood for the trees!
I'm hoping that someone with a lot more expertise than me can point me in the right direction. Ultimately I am looking to expand the model by adding additional level 1 and level 2 predictors. Having looked at these using lme4, I anticipate having to specify interactions cross-level interactions, so I am looking for an approach which is flexible enough to expand in this way. I'm very new to coding so please be gentle with me!
For that kind of case you can use an autoregressive gaussian model (CAR) for the time. As your tag is winbugs (or openbugs), you can use function car.normal as follows. This code needs to be adapted to your dataset !
Data
y should be a matrix with observations in line and time in columns. If you do not have same number of time for each i, just put NA values.
You also need to define the parameters of the temporal process. This is the matrix of neighborhood with the weights. I am sorry, but I do not totally remember how to create it... For autoregressive of order one, this should be something like:
jags.data1 <- list(
# Number of time points
sumNumNeigh.tm = 14,
# Adjacency but I do not remember how it works
adj.tm = c(2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7),
# Number of neighbours to look at for order 1
num.tm = c(1, 2, 2, 2, 2, 2, 2, 1),
# Matrix of data ind / time
y = FO.bin,
# You other parameters
Individual =Individual, Z=Z, N=N, J=J, K=K)
Model
model1 <- function() {
for (i in 1:N) {
for (t in 1:T) {
y[i,t] ~ dbern(p[i,t])
# logit(p[i]) <- a[Individual[i]] + b*time[i]
logit(p[i,t]) <- a[Individual[i]] + b*U[t]
}}
# intrinsic CAR prior on temporal random effects
U[1:T] ~ car.normal(adj.tm[], weights.tm[], num.tm[], prec.nu)
for(k in 1:sumNumNeigh.tm) {weights.tm[k] <- 1 }
# prior on precison of temporal random effects
prec.nu ~ dgamma(0.5, 0.0005)
# conditional variance of temporal random effects
sigma2.nu <- 1/prec.nu
for (j in 1:J) {
a[j] ~ dnorm(a.hat[j],tau.a)
a.hat[j]<-mu.a + inprod(g[],Z[j,])
}
b ~ dnorm(0,.0001)
tau.a<-pow(sigma.a,-2)
sigma.a ~ dunif(0,100)
mu.a ~ dnorm (0,.0001)
for(k in 1:K) {
g[k]~dnorm(0,.0001)
}
}
For your information, with JAGS, you would need to code yourself the CAR model using dmnorm.

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.

Rows without repetitions - MATLAB

I have a matrix (4096x4) containing all possible combinations of four values taken from a pool of 8 numbers.
...
3 63 39 3
3 63 39 19
3 63 39 23
3 63 39 39
...
I am only interested in the rows of the matrix that contain four unique values. In the above section, for example, the first and last row should be removed, giving us -
...
3 63 39 19
3 63 39 23
...
My current solution feels inelegant-- basically, I iterate across every row and add it to a result matrix if it contains four unique values:
result = [];
for row = 1:size(matrix,1)
if length(unique(matrix(row,:)))==4
result = cat(1,result,matrix(row,:));
end
end
Is there a better way ?
Approach #1
diff and sort based approach that must be pretty efficient -
sortedmatrix = sort(matrix,2)
result = matrix(all(diff(sortedmatrix,[],2)~=0,2),:)
Breaking it down to few steps for explanation
Sort along the columns, so that the duplicate values in each row end up next to each other. We used sort for this task.
Find the difference between consecutive elements, which will catch those duplicate after sorting. diff was the tool for this purpose.
For any row with at least one zero indicates rows with duplicate rows. To put it other way, any row with no zero would indicate rows with no duplicate rows, which we are looking to have in the output. all got us the job done here to get a logical array of such matches.
Finally, we have used matrix indexing to select those rows from matrix to get the expected output.
Approach #2
This could be an experimental bsxfun based approach as it won't be memory-efficient -
matches = bsxfun(#eq,matrix,permute(matrix,[1 3 2]))
result = matrix(all(all(sum(matches,2)==1,2),3),:)
Breaking it down to few steps for explanation
Find a logical array of matches for every element against all others in the same row with bsxfun.
Look for "non-duplicity" by summing those matches along dim-2 of matches and then finding all ones elements along dim-2 and dim-3 getting us the same indexing array as had with our previous diff + sort based approach.
Use the binary indexing array to select the appropriate rows from matrix for the final output.
Approach #3
Taking help from MATLAB File-exchange's post combinator
and assuming you have the pool of 8 values in an array named pool8, you can directly get result like so -
result = pool8(combinator(8,4,'p'))
combinator(8,4,'p') basically gets us the indices for 8 elements taken 4 at once and without repetitions. We use these indices to index into the pool and get the expected output.
For a pool of a finite number this will work. Create is unique array, go through each number in pool, count the number of times it comes up in the row, and only keep IsUnique to 1 if there are either one or zero numbers found. Next, find positions where the IsUnique is still 1, extract those rows and we finish.
matrix = [3,63,39,3;3,63,39,19;3,63,39,23;3,63,39,39;3,63,39,39;3,63,39,39];
IsUnique = ones(size(matrix,1),1);
pool = [3,63,39,19,23,6,7,8];
for NumberInPool = 1:8
Temp = sum((matrix == pool(NumberInPool))')';
IsUnique = IsUnique .* (Temp<2);
end
UniquePositions = find(IsUnique==1);
result = matrix(UniquePositions,:)

Consolidating a data table in Scala

I am working on a small data analysis tool, and practicing/learning Scala in the process. However I got stuck at a small problem.
Assume data of type:
X Gr1 x_11 ... x_1n
X Gr2 x_21 ... x_2n
..
X GrK x_k1 ... x_kn
Y Gr1 y_11 ... y_1n
Y Gr3 y_31 ... y_3n
..
Y Gr(K-1) ...
Here I have entries (X,Y...) that may or may not exist in up to K groups, with a series of values for each group. What I want to do is pretty simple (in theory), I would like to consolidate the rows that belong to the same "entity" in different groups. so instead of multiple lines that start with X, I want to have one row with all values from x_11 to x_kn in columns.
What makes things complicated however is that not all entities exist in all groups. So wherever there's "missing data" I would like to pad with for instance zeroes, or some string that denotes a missing value. So if I have (X,Y,Z) in up to 3 groups, the type I table I want to have is as follows:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 N/A N/A y_31 y_32
Z N/A N/A z_21 z_22 N/A N/A
I have been stuck trying to figure this out, is there a smart way to use List functions to solve this?
I wrote this simple loop:
for {
(id, hitlist) <- hits.groupBy(_.acc)
h <- hitlist
} println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
to able to generate the tables that look like the example above. Note that, my original data is of a different format and layout,but that has little to do with the problem at hand, thus I have skipped all steps regarding parsing. I should be able to use groupBy in a better way that actually solves this for me, but I can't seem to get there.
Then I modified my loop mapping the hits to ratios and appending them to one another:
for ((id, hitlist) <- hits.groupBy(_.acc)){
val l = hitlist.map(_.ratios).foldRight(List[Double]()){
(l1: List[Double], l2: List[Double]) => l1 ::: l2
}
println(id + "\t" + l.mkString("\t"))
//println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
}
That gets me one step closer but still no cigar! Instead of a fully padded "matrix" I get a jagged table. Taking the example above:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 y_31 y_32
Z z_21 z_22
Any ideas as to how I can pad the table so that values from respective groups are aligned with one another? I should be able to use _.sampleId, which holds the "group membersip" for each "hit", but I am not sure how exactly. ´hits´ is a List of type Hit which is practically a wrapper for each row, giving convenience methods for getting individual values, so essentially a tuple which have "named indices" (such as .acc, .sampleId..)
(I would like to solve this problem without hardcoding the number of groups, as it might change from case to case)
Thanks!
This is a bit of a contrived example, but I think you can see where this is going:
case class Hit(acc:String, subAcc:String, value:Int)
val hits = List(Hit("X", "x_11", 1), Hit("X", "x_21", 2), Hit("X", "x_31", 3))
val kMax = 4
val nMax = 2
for {
(id, hitlist) <- hits.groupBy(_.acc)
k <- 1 to kMax
n <- 1 to nMax
} yield {
val subId = "x_%s%s".format(k, n)
val row = hitlist.find(h => h.subAcc == subId).getOrElse(Hit(id, subId, 0))
println(row)
}
//Prints
Hit(X,x_11,1)
Hit(X,x_12,0)
Hit(X,x_21,2)
Hit(X,x_22,0)
Hit(X,x_31,3)
Hit(X,x_32,0)
Hit(X,x_41,0)
Hit(X,x_42,0)
If you provide more information on your hits lists then we could probably come with something a little more accurate.
I have managed to solve this problem with the following code, I am putting it here as an answer in case someone else runs into a similar problem and requires some help. The use of find() from Noah's answer was definitely very useful, so do give him a +1 in case this code snippet helps you out.
val samples = hits.groupBy(_.sampleId).keys.toList.sorted
for ((id, hitlist) <- hits.groupBy(_.acc)) {
val ratios =
for (sample <- samples)
yield hitlist.find(h => h.sampleId == sample).map(_.ratios)
.getOrElse(List(Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN))
println(id + "\t" + ratios.flatten.mkString("\t"))
}
I figure it's not a very elegant or efficient solution, as I have two calls to groupBy and I would be interested to see better solutions to this problem.