how to generate confident intervals for predictions of hurdle models - prediction

I'm using 'pscl' package to build hurdle models, in order to model the abundance of birds as a function of environmental variables.
I am interested in predicting the probability of occurrence and abundance in non-sampled areas, in order to generate population estimates for the species.
I was able to generate the models and predictions, but I am having difficulty producing credibility intervals for these estimates.

Related

Statistical testing of classifier performance on different populations subsets

I have a dataset of people's medical records which contains people of different races, income levels, genders, etc.
I have a binary classification problem, and want to compare the accuracy of the two models:
Model 1: I have trained and tested a classifier on a random sample from the population.
Model 2: I have trained and tested a classifier on a sample of the population where the income level is above say 100k.
The test set sizes are of different sizes (both contain over 50,000 people). The test sets may contain some of the same people of course - I have a lot of data and will be doing the above experiment for many different conditions so I am not mentioning how many people overlap between the sets as it will change depending on the condition.
I believe I can't use a standard or modified t-test to compare the performance on the separate test sets, since the demographics are different in the two test sets - is this correct?
Instead, I think the only option is to compare the accuracy of model 1 on the same test as model 2 to figure out if model 2 performs better?

Pooling sensitivity, specificity and other classifcation measures for different cut-off points after MICE

I am assessing the performance of a logistic regression model in a test dataset with incomplete data. I have imputed values for both the predictors and outcome using MICE.
I am now trying to figure out what the best way is to pool classification measures (i.e., sensitivity, specificity, PPV, NPV) across the imputed datasets. There are many posts on this website asking about how to pool AUC values across the different datasets, but I could not find any guidance on here on how to pool classification measures.
One approach would be to calculate the classification measures for all relevant cut-off points in each m separately and to then simply average the estimates for the different cut-off points. Would this approach be okay?

Prediction limit in normal regression and survival regression

I am trying to predict the duration it takes for gas pipes to leak. I used 15 features which the most important one is “pipe installation year”. The latest leak data that I have is for a leak that happened in 2017 and that pipe was installed in 2009 I know that normal ML models that I built will not be able to do a good job in predicting the leak duration for pipes that have been installed after 2009. The reason I say it is because I first sort the data based on their “installation year” and then did a train test split to see how it functions in predicting test dataset, I got %93 R squared but when I turned the shuffle function off in train test split( which means that unlike the normal train test split which in, subsets are chosen randomly, the data will be in order of first %80 training and the last %20 testing) to see if it can predict the pipes that their ”year installed” was not in the model training, I only got %30 R squared. I know that “installation year” is a pretty important feature and the ML model can not predict the pipes that their “installation year” were not trained in the model.
I am also using survival regressions too on top of the normal ML models.I am not sure if I will have the same problem in COX PH model and other multivariate survival models too or not. Does COX PH be able to predict the hazard ratio and survival function for the pipes that were installed after 2009?
Will coxph be able to predict the hazard ratio and survival function for the pipes that were installed after 2009?
coxph should be able to calculate the hazard ratio and survival function for given period (it´s what it is supposed to do). You can run it and plot a KM to see if it makes sense and you can utilize the results.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

How to quantify similarity of tree models? (XGB, Random Forest, Gradient Boosting, etc.)

Are there any algorithms that quantify the similarity of tree based models such as XGB? For example, I train two XGB models with different datasets for example in cross validation and want to estimate the robustness or consistency of the predictions and maybe how features are used.