How to pool results of AUROC curves and Hosmer-Lemeshow test on multiple imputated data (mice)? - pool

I want to pool results of my validation anlaysis (AUROC curve/C-statistic with the 95% CI, calibration Hosmeer-Lemeshow test and plots) in m imputed databased. I provided the following steps (https://stefvanbuuren.name/fimd/sec-knowledge.html):
Imputated the data
mids <- imp(data, m=10)
I transformed the data to long format
imp_long <- complete(mids, action="long", include=T)
I predicted the CVD risk for each individual (greyline model) and added the result as a column to the dataframe
imp_long$Pred.Risk
I changed the long format to a mids object
mids2 <- as.mids(imp_long)
The above all went well. Now I want to validate the risk prediction, by testing discrimination and calibration on the imputed data mids object, using the function with():
The area under the ROC curve
`roc <- with(imp_long, roc(Obs.Risk,Pred.Risk, plot=T, print.auc=T))` # works (gives 10x ROC)
`est1 <- summary(pool(roc))`
Gives error:
Error:
! All columns in a tibble must be vectors.
X Column fun.sesp is a funcion.
X Column call is a call.
Hosmer-Lemeshow test
`hl.test <- with(imp_long, hoslem.test(Obs.Risk,Pred.Risk, g=10)`# works, calculates 10x test result.
`est2 <- summary(pool(hl.test))`
which gives the error:
Error in 'summarize()':
! Problem while computing qbar=mean(.data$estimate)
i The error occured in group 1: parameter=18.
Caused by error in .data$estimate:
! Column éstimate' not found in .data
I think this issue has been reported earlier, but none have a clear answer that I can apply to my situation: (1) https://github.com/amices/mice/issues/218 and (2) Combining ROC estimates from multiple imputation data and (3) https://stats.stackexchange.com/questions/254283/how-to-pool-c-statistic-auroc-or-any-bounded-variable-after-using-multiple-imp
Any help is highly appreciated.

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

Netlab - How are the errors calculated?

I'm trying to optimise and validate a neural network using Netlab on Matlab
I'd like to find the error value for each iteration, so I can see convergence on a plot. This can be done by storing the errors presented in the command window which is done by setting options(1) to 1 using errlog is a netopt output.
However these errors are not the same as mlperr which gives an error value of 0.5*(sum of squares error) for the last iteration. I can't really validly use them if I don't know how they're calculated.
Does anybody know what the errors displayed in the command window represent (I'm using scaled conjugate gradient as my optimisation algorithm)?
Is there a way of storing the mlperr for each iteration that the network
runs?
Any help is greatly appreciated, many thanks!
NB:
I have tried doing something similar to this :
ftp://ftp.dcs.shef.ac.uk/home/spc/com336/neural-lab-wk6.html
However it gives different results to running the network with the number of iterations specified under options(14) rather than k for some reason.
Yes certainly,
The ERRLOG vector, created as an output to the network optimisation function netopt with the following syntax
[NET, OPTIONS, ERRLOG] = netopt(NET, OPTIONS, X, T, ALG)
Each row of ERRLOG gives 0.5*SSE (sum of squares error) for the corresponding iteration of network optimisation. This error is calculated between the predicted outputs (y) and the target outputs (t).
The MLPERR function, hast the following syntax
E = mlperr(NET, X, T)
It also gives 0.5*SSE between predicted outputs (y) and target outputs (t), but as the network parameters are constant (NET should be pre-trained), E is a singular value.
If netopt was run with an ERRLOG output, and then MLPERR was run with the same network and variables, E should be the same value as value of the final row of ERRLOG (the error after the final iteration of network optimisation).
Hope this is of some use to someone!

Multiple comparison for repeated measures ANOVA in matlab

I want to find possible differences between different conditions. I have n subjects for which I have a mean value for every condition for every subject respectively. The values between subjects vary a lot, that's why I wanted to perform a repeated measures anova to control for that.
My within subject factor would be the condition then and I don't have any between subjects factor.
So far I have the following code:
%% create simulated numbers
meanPerf = randn(20,3);
%% create a table array with the mean performance for every condition
tableData = table(meanPerf(:,1),meanPerf(:,2),meanPerf(:,3),'VariableNames',{'meanPerf1','meanPerf2','meanPerf3'})
tableInfo = table([1,2,3]','VariableNames',{'Conditions'})
%% fit repeated measures model to the table data
repMeasModel = fitrm(tableData,'meanPerf1meanPerf3~1','WithinDesign',tableInfo);
%% perform repeated measures anova to check for differences
ranovaTable = ranova(repMeasModel)
My first question is: Am I doing this correctly?
The second question is: How can I perform a post hoc analysis to find out which of the condition are significantly different from each other?
I tried using:
multcompare(ranovaTable,'Conditions');
but that produced the following error:
Error using internal.stats.parseArgs (line 42)
Wrong number of arguments.
I am using Matlab 2015b.
Would be great if you could help me out. I think I'm loosing my mind over this.
Best,
Phill
I was trying the same thing using Matlab R2016a, and I get the following multcompare error message: "STATS must be a stats output structure from ANOVA1, ANOVA2, ANOVAN, AOCTOOL, KRUSKALWALLIS, or FRIEDMAN.".
However, this discussion was helpful for me:
https://www.mathworks.com/matlabcentral/answers/140799-3-way-repeated-measures-anova-pairwise-comparisons-using-multcompare
You might try something like:
multcompare(repMeasModel,'Factor1','By','Factor2)
I believe you'll need to create factors in the within structure of your model too.

keving murphy's hmm matlab toolbox assertion error

I am working on a project that needs to use hidden markov models. I downloaded Kevin Murphy's toolbox. I have some problems about the usage. In the toolbox webpage, he says that first input of dhmm_em and dhmm_logprob are symbol sequence data. On their examples, they give row vectors as data. So, when I give my symbol sequence as row vector, I get error;
??? Error using ==> assert at 9
assertion violated:
Error in ==> fwdback at 105
assert(approxeq(sum(alpha(:,t)),1))
Error in ==> dhmm_logprob at 17
[alpha, beta, gamma, ll] = fwdback(prior,
transmat, obslik, 'fwd_only', 1);
Error in ==> mainCourseProject at 110
loglik(train_act) =
dhmm_logprob(orderedSymbols,
hmm{train_act}.prior,
hmm{train_act}.trans,
hmm{act}.emiss);
However, before giving this error, code works for some symbol vectors. When I give my data as column vector, functions work fine, no errors. So why exactly am I getting this error?
You might say that I should be giving not single vectors, but vector sets, I also tried to collect my feature vectors in a struct and give row vectors as such, but nothing changed, I still get assertion error.
By the way, my symbol sequence does not have any zeros, I am doing everything almost the same as they showed in their examples, so I would be greatful if anyone could help me please.
Im not sure, but from the function call stack shown above, shouldn't the last line be hmm{train_act}.emiss instead of hmm{act}.emiss.
In other words when you computing the log-probability of a sequence, you should pass components that belong to the same HMM model (transition matrix, emission matrix, and prior probabilities).
By the way, the ASSERT in the code is a sanity check that a vector of probabilities should sum to 1. Oftentimes, when working with very small values (log-probabilities), numerical stability issues can creep in... You could edit the APPROXEQ function to relax the comparison a bit, by giving it a bigger margin of error
This error message and the code it refers to are human-readable. An assertion is a guard put in by the programmer, to ensure that certain conditions are met. In this case, what is the condition? approxeq(sum(alpha(:,t)),1) I'd venture to say that approxeq wants the values to be approximately equal, so this boils down to: sum(alpha(:,t)) ~= 1
Without knowing anything about the code, I'd also guess that these refer to probabilities. The probabilities of a node's edges must sum to one. Hopefully this starts you down a productive debugging path. If you can't figure out what's wrong with your input that produces this condition, start wading into the code a bit to see where this alpha vector comes from, and how it ended up invalid.

Matlab Weka Interface AdaBoost Issues: Out of Bounds Exception

I'm doing some cross-validation using a Matlab Weka Interface that I got from file exchange. My loop structure seems to work fine for Weka's Logistic classifier. However, when I try to do the exact same thing for AdaBoostM1, it throws the following error:
??? Java exception occurred: java.lang.ArrayIndexOutOfBoundsException
Error in ==> wekaClassify at 24 classProbs(t+1,:) = (classifier.distributionForInstance(testData.instance(t)))';
Error in ==> classifier_search at 225 [pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I have determined through some testing that this only occurs when the number of instances in the training set is greater than the number of instances in the test set. I am sure you can see why that is a problem for me, since in most situations the training set is greater than the test set in size.
Is there something different about how I should format my inputs when using Adaboost rather than Logistic? Any information you can give regarding this problem would be so helpful.
I downloaded this code from this page: http://www.mathworks.com/matlabcentral/fileexchange/21204-matlab-weka-interface
Emails bounce from the account of the guy who made it, and he doesn't seem to respond to comments on the page - I'm hoping that maybe someone here has used this.
EDIT: Here is the code that I use to train and test the classifier:
classifier = trainWekaClassifier(matlab2weka('training', featurelabels, train), 'meta.AdaBoostM1', { strcat('-P 100 -S 1 -I ', num2str(r), '-W weka.classifiers.trees.DecisionStump')});
[pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I haven't used this combination of software, so I can only take a guess at what could cause this.
Are your training/testing data matrices the right way round? They should be N-by-D (N instances, D features).
If you were passing in a D-by-N training matrix and a D-by-M testing matrix, then I would expect it to work only when M < N - which is what you describe - and even then, it wouldn't give a meaningful result.