Translating glmer (binomial) into jags to include a correlated random effect (time) - multi-level

Context:
I have a 12 item risk assessment where individuals are given a rating from 0-4 (4 being the highest risk).  The risk assessment can be done multiple times for each individual (max = 19, but most only have less than 5 measurements).
The baseline level of risk varies by individual so I am looking for a random intercepts model, but also need to reflect the dynamic nature of the risk ie adding 'time' as a random coefficient.
The outcome is binary:
further offending (FO.bin) which occurs at the measurement level and would mean that I am essentially looking at what dynamic changes have occurred within one or more of the 12 items and how they have contributed to the individual committing a further offence in the period between the measurements
Ultimately what I am essentially looking to do is to predict whether an individual will offend in the future, based on other's (who share the same characteristics) assessment history, contextual factors and factors which may change over time.  
Goal:
I wish to add to my 'basic' model by adding time-varying (level 1) and time-invariant (level 2) predictors:
Time varying include dummy variables around the criminal justice process such as non-compliance, going to court and spending time in custody.   These are reflected as being an 'event' which has occurred in the period between assessments
Time invariant includes dummy variables such as being female, being non-White, and continuous predictors such as age at time of first offence
I've managed to set this up OK using lmer4 and have some potentially interesting results from adding the level 1 and level 2 predictors including where there are interactions and cross-interactions.  However, the complexity of the enhanced models is throwing up all kinds of warning messages including ones about failing to converge.   I therefore feel that it would be appropriate to switch to a Bayesian framework using Rjags so that I can feel more confident about my findings. 
The Problem:
Basically it is one of translation.  This is my 'basic' model based on time and the 12 items in the risk assessment using lme4:
Basic_Model1 <- glmer(BinaryResponse ~ item1 + item2 + item3 + ... + item12 + time + (1+time|individual), data=data, family=binomial)
This is my attempt to translate this into a BUGS model:
# the number of Risk Assessments = 552
N <-nrow(data)                                                            
# number of Individuals (individual previously specified) = 88
J <- length(unique(Individual))                                           
# the 12 items (previously specified)
Z <- cbind(item1, item2, item3, item4, ... item12)                        
# number of columns = number of predictors, will increase as model enhanced
K <- ncol(Z)                                                              
## Store all data needed for the model in a list
jags.data1 <- list(y = FO.bin, Individual =Individual, time=time, Z=Z, N=N, J=J, K=K)                   
model1 <- function() {
    for (i in 1:N) {
    y[i] ~ dbern(p[i])
    logit(p[i]) <- a[Individual[i]] + b*time[i]
  }
 
  for (j in 1:J) {
    a[j] ~ dnorm(a.hat[j],tau.a)
    a.hat[j]<-mu.a + inprod(g[],Z[j,])
  }
  b ~ dnorm(0,.0001)
  tau.a<-pow(sigma.a,-2)
  sigma.a ~ dunif(0,100)
 
  mu.a ~ dnorm (0,.0001)
  for(k in 1:K) {
    g[k]~dnorm(0,.0001)
  }
}
write.model(model1, "Model_1.bug")
Looking at the output, my gut feeling is that I've not added the varying coefficient for time and that what I have done so far is only the equivalent of
Basic_Model2 <- glmer(BinaryResponse ~ item1 + item2 + item3 + ... + item12 + time + (1|individual), data=data, family=binomial)
How do I tweak my BUGS code to reflect time as a varying co-efficient ie Basic_Model1 ? 
Based on the examples I have managed to find, I know that I need to make an additional specification in the J loop so that I can monitor the U[j], and there is a need to change the second part of the logit statement involving time, but its got to the point where I can't see the wood for the trees!
I'm hoping that someone with a lot more expertise than me can point me in the right direction. Ultimately I am looking to expand the model by adding additional level 1 and level 2 predictors. Having looked at these using lme4, I anticipate having to specify interactions cross-level interactions, so I am looking for an approach which is flexible enough to expand in this way. I'm very new to coding so please be gentle with me!

For that kind of case you can use an autoregressive gaussian model (CAR) for the time. As your tag is winbugs (or openbugs), you can use function car.normal as follows. This code needs to be adapted to your dataset !
Data
y should be a matrix with observations in line and time in columns. If you do not have same number of time for each i, just put NA values.
You also need to define the parameters of the temporal process. This is the matrix of neighborhood with the weights. I am sorry, but I do not totally remember how to create it... For autoregressive of order one, this should be something like:
jags.data1 <- list(
# Number of time points
sumNumNeigh.tm = 14,
# Adjacency but I do not remember how it works
adj.tm = c(2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7),
# Number of neighbours to look at for order 1
num.tm = c(1, 2, 2, 2, 2, 2, 2, 1),
# Matrix of data ind / time
y = FO.bin,
# You other parameters
Individual =Individual, Z=Z, N=N, J=J, K=K)
Model
model1 <- function() {
for (i in 1:N) {
for (t in 1:T) {
y[i,t] ~ dbern(p[i,t])
# logit(p[i]) <- a[Individual[i]] + b*time[i]
logit(p[i,t]) <- a[Individual[i]] + b*U[t]
}}
# intrinsic CAR prior on temporal random effects
U[1:T] ~ car.normal(adj.tm[], weights.tm[], num.tm[], prec.nu)
for(k in 1:sumNumNeigh.tm) {weights.tm[k] <- 1 }
# prior on precison of temporal random effects
prec.nu ~ dgamma(0.5, 0.0005)
# conditional variance of temporal random effects
sigma2.nu <- 1/prec.nu
for (j in 1:J) {
a[j] ~ dnorm(a.hat[j],tau.a)
a.hat[j]<-mu.a + inprod(g[],Z[j,])
}
b ~ dnorm(0,.0001)
tau.a<-pow(sigma.a,-2)
sigma.a ~ dunif(0,100)
mu.a ~ dnorm (0,.0001)
for(k in 1:K) {
g[k]~dnorm(0,.0001)
}
}
For your information, with JAGS, you would need to code yourself the CAR model using dmnorm.

Related

Looking for advice on improving a custom function in AnyLogic

I'm estimating last mile delivery costs in an large urban network using by-route distances. I have over 8000 customer agents and over 100 retail store agents plotted in a GIS map using lat/long coordinates. Each customer receives deliveries from its nearest store (by route). The goal is to get two distance measures in this network for each store:
d0_bar: the average distance from a store to all of its assigned customers
d1_bar: the average distance between all customers common to a single store
I've written a startup function with a simple foreach loop to assign each customer to a store based on by-route distance (customers have a parameter, "customer.pStore" of Store type). This function also adds, in turn, each customer to the store agent's collection of customers ("store.colCusts"; it's an array list with Customer type elements).
Next, I have a function that iterates through the store agent population and calculates the two average distance measures above (d0_bar & d1_bar) and writes the results to a txt file (see code below). The code works, fortunately. However, the problem is that with such a massive dataset, the process of iterating through all customers/stores and retrieving distances via the openstreetmap.org API takes forever. It's been initializing ("Please wait...") for about 12 hours. What can I do to make this code more efficient? Or, is there a better way in AnyLogic of getting these two distance measures for each store in my network?
Thanks in advance.
//for each store, record all customers assigned to it
for (Store store : stores)
{
distancesStore.print(store.storeCode + "," + store.colCusts.size() + "," + store.colCusts.size()*(store.colCusts.size()-1)/2 + ",");
//calculates average distance from store j to customer nodes that belong to store j
double sumFirstDistByStore = 0.0;
int h = 0;
while (h < store.colCusts.size())
{
sumFirstDistByStore += store.distanceByRoute(store.colCusts.get(h));
h++;
}
distancesStore.print((sumFirstDistByStore/store.colCusts.size())/1609.34 + ",");
//calculates average of distances between all customer nodes belonging to store j
double custDistSumPerStore = 0.0;
int loopLimit = store.colCusts.size();
int i = 0;
while (i < loopLimit - 1)
{
int j = 1;
while (j < loopLimit)
{
custDistSumPerStore += store.colCusts.get(i).distanceByRoute(store.colCusts.get(j));
j++;
}
i++;
}
distancesStore.print((custDistSumPerStore/(loopLimit*(loopLimit-1)/2))/1609.34);
distancesStore.println();
}
Firstly a few simple comments:
Have you tried timing a single distanceByRoute call? E.g. can you try running store.distanceByRoute(store.colCusts.get(0)); just to see how long a single call takes on your system. Routing is generally pretty slow, but it would be good to know what the speed limit is.
The first simple change is to use java parallelism. Instead of using this:
for (Store store : stores)
{ ...
use this:
stores.parallelStream().forEach(store -> {
...
});
this will process stores entries in parallel using standard Java streams API.
It also looks like the second loop - where avg distance between customers is calculated doesn't take account of mirroring. That is to say distance a->b is equal to b->a. Hence, for example, 4 customers will require 6 calculations: 1->2, 1->3, 1->4, 2->3, 2->4, 3->4. Whereas in case of 4 customers your second while loop will perform 9 calculations: i=0, j in {1,2,3}; i=1, j in {1,2,3}; i=2, j in {1,2,3}, which seems wrong unless I am misunderstanding your intention.
Generally, for long running operations it is a good idea to include some traceln to show progress with associated timing.
Please have a look at above and post results. With more information additional performance improvements may be possible.

Why do I get NA values after filtering a dataset and running a regression based on that set?

I am using the gpa1 dataset in R and am trying to filter out all instances where a student has worked less than or equal to 19 hours and does not volunteer or participate in any clubs. I can run the filter and save the data fine, and get the results I want (job19=1, clubs=0, and voluntr=0, but when I try to run a regression based on the new dataset the job19 values just come up as "NA"
This is the code that I used:
PTjob19 <- filter(gpa1, job19 == 1, voluntr == 0, clubs == 0)
View(PTjob19)
olsreg9 <- lm(colGPA~job19 + age + male + hsGPA + ACT + siblings + skipped + fathcoll + mothcoll, data=PTjob19)
I expected that job19 would have some kind of value be it negative or positive, but the values were just listed as "NA." Cam someone please help?
So I figured this out. The reason I was getting an "NA" value was that I wasn't giving R anything to compare. I was trying to run a regression where people only worked less than 20 jobs, which of course R couldn't give me stats on.

MCMCglmm questions: multiple species and ultrametric trees

These questions are related to my other question at Phylogenetic model using multiple entries for each species
Thanks to #thomas-guillerme, I was able to start running an MCMCglmm model.
Although I had no problem running some of my example files in which I had a single entry for each of the species in my tree, I found an error message when trying to run my original dataset, which consists of thousands of entries for each of the species in my tree. When running:
comp_data <- comparative.data(phy = my_tree, data =my_data, names.col = species, vcv = TRUE)’
I got an error:
'Error in row.names<-.data.frame(tmp, value = value) : duplicate
'row.names' are not allowed In addition: Warning message: non-unique
values when setting 'row.names': ‘Species1’, ‘Species2’,
‘Species3’, ‘Species4’,...
I was surprised because I am using MCMCglmm and not PGLS because of the chance of using multiple entries for each species.
I tried the workaround of make the species name unique but in that case only the first entry of each species is recognized later in the model (because it corresponds with the name in my_tree).
Moreover, I had problems with having my tree recognized as ultrametric. I checked it using
'is.ultrametric(my_tree)'
Got:
FALSE
I tried:
function (phy) { if(any(is.ultrametric(my_tree)) == FALSE) { my_tree <- lapply(my_tree, chronoMPL) class(my_tree) <- "Phylo"
}
}
But these lines apparently do not solve the problem. Thanks in advance for your help.
Hard to tell without a running example but for the second question at least, it seems that the bug comes from the phy argument not being passed to the function at all (it's using my_tree
check.fun <- function(my_tree) {
if(any(is.ultrametric(my_tree)) == FALSE) {
my_tree <- lapply(my_tree, chronoMPL)
class(my_tree) <- "Phylo"
}
return(my_tree)
}
For the first point, you might want to try to run it through the mulTree package that does a lot of housekeeping:
## Loading/installing the package
library(devtools)
install_github("TGuillerme/mulTree")
library(mulTree)
## Loading the example data
data(lifespan)
## Randomly combining trees
combined_trees <- tree.bind(x = trees_mammalia, y = trees_aves, sample = 2,
root.age = 250)
We can then generate an example with multiple specimens per species:
## Subset of the data
data <- lifespan_volant[sample(nrow(lifespan_volant), 30),]
## Create a dataset with two specimen per species
data <- rbind(cbind(data, specimen = rep("spec1", 30)), cbind(data,
specimen = rep("spec2", 30)))
Note that the first column contains the list of species with multiple specimens per species (specified in column $specimen)
head(data[order(data$species),])
# species class longevity mass volant specimen
#16 Addax_nasomaculatus Mammalia 0.8413927 1.8227058 nonvolant spec1
#161 Addax_nasomaculatus Mammalia 0.8413927 1.8227058 nonvolant spec2
#140 Anser_anser Aves 0.9929849 0.5993055 volant spec1
#1401 Anser_anser Aves 0.9929849 0.5993055 volant spec2
#21 Antilope_cervicapra Mammalia 0.6055864 1.4910746 nonvolant spec1
#211 Antilope_cervicapra Mammalia 0.6055864 1.4910746 nonvolant spec2
You can then use the clean.data function to make sure the trees match the dataset (specifying which column contains the species names)
## Making sure both the trees and the data match
cleaned_data <- clean.data(data, combined_trees, data.col = "species")
## Only using the cleaned version
trees <- cleaned_data$tree
data <- cleaned_data$data
You can find the eventual dropped tips/rows in cleaned_data$dropped_tips and cleaned_data$dropped_rows:
## Creates a mulTree object specifying species AND specimens as random terms
mulTree_data <- as.mulTree(data, trees, taxa = "species",
rand.terms = ~species+specimen)
## formula to test
test_formula <- longevity ~ mass + volant
## MCMC parameters (number of generations, thin/sampling, burnin)
mcmc_parameters <- c(101000, 10, 1000)
## priors
mcmc_priors <- list(R = list(V = 1/2, nu = 0.002),
G = list(G1 = list(V = 1/2, nu = 0.002)))
## Running MCMCglmm on multiple trees
mulTree(mulTree_data, formula = test_formula, parameters = mcmc_parameters,
priors = mcmc_priors, output = "longevity.example", ESS = 50)
To analyse the resulting files, you can use read.mulTree and subsequent functions (see the mulTree manual).

partial Distance Based RDA - Centroids vanished from Plot

I am trying to fir a partial db-RDA with field.ID to correct for the repeated measurements character of the samples. However including Condition(field.ID) leads to Disappearance of the centroids of the main factor of interest from the plot (left plot below).
The Design: 12 fields have been sampled for species data in two consecutive years, repeatedly. Additionally every year 3 samples from reference fields have been sampled. These three fields have been changed in the second year, due to unavailability of the former fields.
Additionally some environmental variables have been sampled (Nitrogen, Soil moisture, Temperature). Every field has an identifier (field.ID).
Using field.ID as Condition seem to erroneously remove the F1 factor. However using Sampling campaign (SC) as Condition does not. Is the latter the rigth way to correct for repeated measurments in partial db-RDA??
set.seed(1234)
df.exp <- data.frame(field.ID = factor(c(1:12,13,14,15,1:12,16,17,18)),
SC = factor(rep(c(1,2), each=15)),
F1 = factor(rep(rep(c("A","B","C","D","E"),each=3),2)),
Nitrogen = rnorm(30,mean=0.16, sd=0.07),
Temp = rnorm(30,mean=13.5, sd=3.9),
Moist = rnorm(30,mean=19.4, sd=5.8))
df.rsp <- data.frame(Spec1 = rpois(30, 5),
Spec2 = rpois(30,1),
Spec3 = rpois(30,4.5),
Spec4 = rpois(30,3),
Spec5 = rpois(30,7),
Spec6 = rpois(30,7),
Spec7 = rpois(30,5))
data=cbind(df.exp, df.rsp)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(SC), df.exp); ordiplot(dbRDA)
dbRDA <- capscale(df.rsp ~ F1 + Nitrogen + Temp + Moist + Condition(field.ID), df.exp); ordiplot(dbRDA)
You partial out variation due to ID and then you try to explain variable aliased to this ID, but it was already partialled out. The key line in the printed output was this:
Some constraints were aliased because they were collinear (redundant)
And indeed, when you ask for details, you get
> alias(dbRDA, names=TRUE)
[1] "F1B" "F1C" "F1D" "F1E"
The F1? variables were constant within ID which already was partialled out, and nothing was left to explain.

Consolidating a data table in Scala

I am working on a small data analysis tool, and practicing/learning Scala in the process. However I got stuck at a small problem.
Assume data of type:
X Gr1 x_11 ... x_1n
X Gr2 x_21 ... x_2n
..
X GrK x_k1 ... x_kn
Y Gr1 y_11 ... y_1n
Y Gr3 y_31 ... y_3n
..
Y Gr(K-1) ...
Here I have entries (X,Y...) that may or may not exist in up to K groups, with a series of values for each group. What I want to do is pretty simple (in theory), I would like to consolidate the rows that belong to the same "entity" in different groups. so instead of multiple lines that start with X, I want to have one row with all values from x_11 to x_kn in columns.
What makes things complicated however is that not all entities exist in all groups. So wherever there's "missing data" I would like to pad with for instance zeroes, or some string that denotes a missing value. So if I have (X,Y,Z) in up to 3 groups, the type I table I want to have is as follows:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 N/A N/A y_31 y_32
Z N/A N/A z_21 z_22 N/A N/A
I have been stuck trying to figure this out, is there a smart way to use List functions to solve this?
I wrote this simple loop:
for {
(id, hitlist) <- hits.groupBy(_.acc)
h <- hitlist
} println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
to able to generate the tables that look like the example above. Note that, my original data is of a different format and layout,but that has little to do with the problem at hand, thus I have skipped all steps regarding parsing. I should be able to use groupBy in a better way that actually solves this for me, but I can't seem to get there.
Then I modified my loop mapping the hits to ratios and appending them to one another:
for ((id, hitlist) <- hits.groupBy(_.acc)){
val l = hitlist.map(_.ratios).foldRight(List[Double]()){
(l1: List[Double], l2: List[Double]) => l1 ::: l2
}
println(id + "\t" + l.mkString("\t"))
//println(id + "\t" + h.sampleId + "\t" + h.ratios.mkString("\t"))
}
That gets me one step closer but still no cigar! Instead of a fully padded "matrix" I get a jagged table. Taking the example above:
X x_11 x_12 x_21 x_22 x_31 x_32
Y y_11 y_12 y_31 y_32
Z z_21 z_22
Any ideas as to how I can pad the table so that values from respective groups are aligned with one another? I should be able to use _.sampleId, which holds the "group membersip" for each "hit", but I am not sure how exactly. ´hits´ is a List of type Hit which is practically a wrapper for each row, giving convenience methods for getting individual values, so essentially a tuple which have "named indices" (such as .acc, .sampleId..)
(I would like to solve this problem without hardcoding the number of groups, as it might change from case to case)
Thanks!
This is a bit of a contrived example, but I think you can see where this is going:
case class Hit(acc:String, subAcc:String, value:Int)
val hits = List(Hit("X", "x_11", 1), Hit("X", "x_21", 2), Hit("X", "x_31", 3))
val kMax = 4
val nMax = 2
for {
(id, hitlist) <- hits.groupBy(_.acc)
k <- 1 to kMax
n <- 1 to nMax
} yield {
val subId = "x_%s%s".format(k, n)
val row = hitlist.find(h => h.subAcc == subId).getOrElse(Hit(id, subId, 0))
println(row)
}
//Prints
Hit(X,x_11,1)
Hit(X,x_12,0)
Hit(X,x_21,2)
Hit(X,x_22,0)
Hit(X,x_31,3)
Hit(X,x_32,0)
Hit(X,x_41,0)
Hit(X,x_42,0)
If you provide more information on your hits lists then we could probably come with something a little more accurate.
I have managed to solve this problem with the following code, I am putting it here as an answer in case someone else runs into a similar problem and requires some help. The use of find() from Noah's answer was definitely very useful, so do give him a +1 in case this code snippet helps you out.
val samples = hits.groupBy(_.sampleId).keys.toList.sorted
for ((id, hitlist) <- hits.groupBy(_.acc)) {
val ratios =
for (sample <- samples)
yield hitlist.find(h => h.sampleId == sample).map(_.ratios)
.getOrElse(List(Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN, Double.NaN))
println(id + "\t" + ratios.flatten.mkString("\t"))
}
I figure it's not a very elegant or efficient solution, as I have two calls to groupBy and I would be interested to see better solutions to this problem.