Internal validation with small sample in R - prediction

I am doing a prediction task with a small sample size (300 points with 18 predictors) using R.
Following Frank Harrel's suggestion, I would like to go with bootstrapping.
Below is my understanding of the procedure.
It uses bootstrap only once to generate the training data.
Create a bootstrapped sample that has the same size as the original sample.
Use the bootstrapped sample as a training set and validate with the original sample.
Average performance over +1000 repetitions.
So I am using a simple function, sample(data, replicate = TRUE), to accomplish the first step.
Is there anything I am missing in the procedure, for ex., another bootstrapping needs to be done or the kind of bootstrap?
Thanks.

Related

Am I using too many training data in GEE?

I am running a classification script in GEE and I have about 2100 training data since my AOI is a region in Italy and have many classes. I receive the following error while I try save my script:
Script error File too large (larger than 512KB).
I tried cancelling some of the training data and it saves. I thought there is no limit in GEE to choose training points. How can I know what is the limit so I adjust my training points or if there is a way to save the script without deleting any points.
Here is the link to my code
The Earth Engine Code Editor “drawing tools” are a convenient, but not very scalable, way to create geometry. The error you're getting is because “under the covers” they actually create additional code that is part of your script file. Not only is this fairly verbose (hence the error you received), it's not very efficient to run, either.
In order to use large training data sets, you will need to create your point data in another tool and upload it (using CSV or SHP files) to become one or more Earth Engine “table” assets, and use those from your script.

Is it necessary to initialized the weights for every time retraining the same model in matlab with nntool?

I know for the ANN model, the initial weights are random. If I train a model and repeat training 10 times by nntool, do the weights initialize every time when I click the training button, or still use the same initial weights you just adjusted?
I am not sure if the nntool you refer to uses the train method (see here https://de.mathworks.com/help/nnet/ref/train.html).
I have used this method quite extensively and it works in a similar way as tensorflow, you store a number of checkpoints and load the latest status to continue training from such point. The code would look something like this.
[feat,target] = iris_dataset;
my_nn = patternnet(20);
my_nn = train(my_nn,feat,target,'CheckpointFile','MyCheckpoint','CheckpointDelay',30);
Here we have requested that checkpoints are stored at a rate not greater than one each 30 seconds. When you want to continue training the net must be loaded from the checkpoint file as:
[feat,target] = iris_dataset;
load MyCheckpoint
my_nn = checkpoint.my_nn;
my_nn = train(my_nn,feat,target,'CheckpointFile','MyCheckpoint');
This solution involves training the network from the command line or via a script rather than using the GUI supplied by Mathworks. I honestly think this latter method is great for beginners but if you want to do any interesting clever start using the command line or even better switch to libraries like Torch or Tensorflow!
Hope it helps!

xgboost R package early_stop_rounds does not trigger

I am using xgb.train() in xgboost R package to fit a classification model. I am trying to figure out what's the best iteration to stop the tree. I set the early_stop_rounds=6 and by watching each iteration's metrics I can clearly see that the auc performance on validation data reach the max and then decrease. However the model does not stop and keep going until the specified nround is reached.
Question 1: Is it the best model (for given parameter) defined at iteration when validation performance start to decrease?
Question 2: Why does the model does not stop when auc on validation start to decrease?
Question 3: How does Maximize parameter=FALSE mean? What will make it stop if it is set to FALSE? does it have to be FALSE when early_stop_round is set?
Question 4: How does the model know which one is the validation data in the watchlist? I've seen people use test=,eval=, validation1= etc?
Thank you!
param<-list(
objective="binary:logistic",
booster="gbtree",
eta=0.02, #Control the learning rate
max.depth=3, #Maximum depth of the tree
subsample=0.8, #subsample ratio of the training instance
colsample_bytree=0.5 # subsample ratio of columns when constructing each tree
)
watchlist<-list(train=mtrain,validation=mtest)
sgb_model<-xgb.train(params=param, # this is the modeling parameter set above
data = mtrain,
scale_pos_weight=1,
max_delta_step=1,
missing=NA,
nthread=2,
nrounds = 500, #total run 1500 rounds
verbose=2,
early_stop_rounds=6, #if performance not improving for 6 rounds, model iteration stops
watchlist=watchlist,
maximize=FALSE,
eval.metric="auc" #Maximize AUC to evaluate model
#metric_name = 'validation-auc'
)
Answer 1: No, not the best but good enough from bias-variance
tradeoff point of view.
Answer 2: It works, may be there is some problem with your code. Would you please give share the progress output of train and test set AUCs at each boosting step to prove this? If you are 100% sure its not working then you can submit error ticket in XGBoost git project.
Answer 3: Maximize=FALSE is for custom optimization function (say custom merror type of thing). You always want to maximize/increase AUC so Maximize=TRUE is better for you.
Answer 4: Its mostly position based. Train part first. Next should go into validation/evaluation.

Gaussian Mixture Modelling Matlab

Im using the Gaussian Mixture Model to estimate loglikelihood function(the parameters are estimated by the EM algorithm)Im using Matlab...my data is of the size:17991402*1...17991402 data points of one dimension:
When I run gmdistribution.fit(X,2) I get the desired output
But when I run gmdistribution.fit(X,k) for k>2....the code crashes and I get the error"OUT OF MEMORY"..I have also tried an open source code which again gives me the same problem.Can someone help me out here?..Im basically looking for a code which will allow me to use different number of components on such a large dataset.
Thanks!!!
Is it possible for you to decrease the iteration time? The default is 100.
OPTIONS = statset('MaxIter',50,'Display','final','TolFun',1e-6)
gmdistribution.fit(X,3,OPTIONS)
Or you may consider under-sampling the original data.
A general solution to out of memory problem is described in this document.

Sample data for testing binary linear classificaion code

I am loking for some sample binary data for testing my linear classifiation code. I need a data set where the data is 2d and belongs to either one of two classes. If anyone has such data or any reference for the same, kindly reply. Any help is appreciated.
I have my own dataset which contain 2 categories of data with 2 features each.
http://dl.dropbox.com/u/28068989/segmentation_mi_kit.zip
Extract this archive and go to 'segmentation_mi_kit/mango_banyan_dataset/'
Alternately if you want something standard to test your algorithm on, have a look at UCI Machine Learning dataset : http://www.ics.uci.edu/~mlearn/
I guess thatz a kind of data you need.