xgboost R package early_stop_rounds does not trigger - classification

I am using xgb.train() in xgboost R package to fit a classification model. I am trying to figure out what's the best iteration to stop the tree. I set the early_stop_rounds=6 and by watching each iteration's metrics I can clearly see that the auc performance on validation data reach the max and then decrease. However the model does not stop and keep going until the specified nround is reached.
Question 1: Is it the best model (for given parameter) defined at iteration when validation performance start to decrease?
Question 2: Why does the model does not stop when auc on validation start to decrease?
Question 3: How does Maximize parameter=FALSE mean? What will make it stop if it is set to FALSE? does it have to be FALSE when early_stop_round is set?
Question 4: How does the model know which one is the validation data in the watchlist? I've seen people use test=,eval=, validation1= etc?
Thank you!
param<-list(
objective="binary:logistic",
booster="gbtree",
eta=0.02, #Control the learning rate
max.depth=3, #Maximum depth of the tree
subsample=0.8, #subsample ratio of the training instance
colsample_bytree=0.5 # subsample ratio of columns when constructing each tree
)
watchlist<-list(train=mtrain,validation=mtest)
sgb_model<-xgb.train(params=param, # this is the modeling parameter set above
data = mtrain,
scale_pos_weight=1,
max_delta_step=1,
missing=NA,
nthread=2,
nrounds = 500, #total run 1500 rounds
verbose=2,
early_stop_rounds=6, #if performance not improving for 6 rounds, model iteration stops
watchlist=watchlist,
maximize=FALSE,
eval.metric="auc" #Maximize AUC to evaluate model
#metric_name = 'validation-auc'
)

Answer 1: No, not the best but good enough from bias-variance
tradeoff point of view.
Answer 2: It works, may be there is some problem with your code. Would you please give share the progress output of train and test set AUCs at each boosting step to prove this? If you are 100% sure its not working then you can submit error ticket in XGBoost git project.
Answer 3: Maximize=FALSE is for custom optimization function (say custom merror type of thing). You always want to maximize/increase AUC so Maximize=TRUE is better for you.
Answer 4: Its mostly position based. Train part first. Next should go into validation/evaluation.

Related

H2O.ai mini_batch_size is really used?

In the documentation of H2O is written:
mini_batch_size: Specify a value for the mini-batch size. (Smaller values lead to a better fit; larger values can speed up and generalize better.)
but when I run a model using the FLOW UI (with mini_batch_size > 1) in the log file is written:
WARN: _mini_batch_size Only mini-batch size = 1 is supported right now.
so the question: is the mini_batch_size really used??
It appears to be a left over from preparation for a DeepWater integration that never happened. E.g. https://github.com/h2oai/h2o-3/search?l=Java&p=2&q=mini_batch_size
That makes sense, because the Hogwild! algorithm, that H2O's deep learning uses, does away with the need for batching training data.
To sum up, I don't think it is used.

Is it necessary to initialized the weights for every time retraining the same model in matlab with nntool?

I know for the ANN model, the initial weights are random. If I train a model and repeat training 10 times by nntool, do the weights initialize every time when I click the training button, or still use the same initial weights you just adjusted?
I am not sure if the nntool you refer to uses the train method (see here https://de.mathworks.com/help/nnet/ref/train.html).
I have used this method quite extensively and it works in a similar way as tensorflow, you store a number of checkpoints and load the latest status to continue training from such point. The code would look something like this.
[feat,target] = iris_dataset;
my_nn = patternnet(20);
my_nn = train(my_nn,feat,target,'CheckpointFile','MyCheckpoint','CheckpointDelay',30);
Here we have requested that checkpoints are stored at a rate not greater than one each 30 seconds. When you want to continue training the net must be loaded from the checkpoint file as:
[feat,target] = iris_dataset;
load MyCheckpoint
my_nn = checkpoint.my_nn;
my_nn = train(my_nn,feat,target,'CheckpointFile','MyCheckpoint');
This solution involves training the network from the command line or via a script rather than using the GUI supplied by Mathworks. I honestly think this latter method is great for beginners but if you want to do any interesting clever start using the command line or even better switch to libraries like Torch or Tensorflow!
Hope it helps!

How to use KNN to classify data in MATLAB?

I'm having problems in understanding how K-NN classification works in MATLAB.´
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
consistency=sum(label==val_class')/length(val_class);
if consistency>precisionmax
precisionmax=consistency;
koptimal=know;
end
end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);
Thank you very much for all your help
For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!

Mahout: Why is using setProbes() having this affect?

I'm using mahout 0.7 to do some classification. I have an encoder for a continuous variable
ContinuousValueEncoder durationPlanEncoder = new ContinuousValueEncoder("duration_plan");
The feature associated with this encoder is a number of days and can range from about 6 to 16.
I'm using an OnlineLogisticRegression model and I use the encoder to train it:
durationPlanEncoder.addToVector(null, <duration_plan double val>, trainDataVector);
For simplicity (since i'm trying to understand this whole classification thing while also learning Mahout), i am using 2 variables: 1) a categorical variable with 6 categories -- one of which ("dev") always predicts the =1 category; and 2) this "duration_plan" variable.
What i expect to find is that, when i give the classifier test data that consists of the category "dev" and a "duration_plan" value, the accuracy of the classifier will increase as the "duration_plan" value i give it gets closer to its average value across the training data. This is not what i'm seeing, however. Instead, the accuracy of the classifier improves as the value of "duration_plan" goes to 0.0. However -- there are no training vectors with duration_plan=0.0!! Why would this be the case?
Then i modified my durationPlanEncoder as follows:
durationPlanEncoder.setProbes(2);
and the accuracy improved. It got even better when i made the number of probes 20, then 200. Why? What is setProbes() doing and is this an anomaly or is this actually how i should be doing it?
The final part of my question is to mention that, even after setting setProbes(20), changing the value of "duration_plan" in the test data has no effect on the accuracy of the classifier -- which I don't think is how it should be. If i give a value for duration_plan that doesn't even exist in any of the training data and thus is never correlated with the =1 class, i would expect the classifier to classify the test sample as =0. Right? Which makes me think i must be coding something just plain wrong. Any suggestions are appreciated.
Mahout documentation is woefully sparse.
thanks.

What is the meaning of "drop" and "sgd" while training custom ner model using spacy?

I am training a custom ner model to identify organization name in addresses.
My training loop looks like this:-
for itn in range(100):
random.shuffle(TRAIN_DATA)
losses = {}
batches = minibatch(TRAIN_DATA, size=compounding(15., 32., 1.001))
for batch in batches
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer,
drop=0.25, losses=losses)
print('Losses', losses)
Can someone explain the parameters "drop", "sgd", "size" and give some ideas to how should I change these values, so that my model performs better.
You can find details and tips in the spaCy documentation:
https://spacy.io/usage/training#tips-batch-size:
The trick of increasing the batch size is starting to become quite popular ... In training the various spaCy models, we haven’t found much advantage from decaying the learning rate – but starting with a low batch size has definitely helped
batch_size = compounding(1, max_batch_size, 1.001)
This will set the batch size to start at 1, and increase each batch until it reaches a maximum size.
https://spacy.io/usage/training#tips-dropout:
For small datasets, it’s useful to set a high dropout rate at first, and decay it down towards a more reasonable value. This helps avoid the network immediately overfitting, while still encouraging it to learn some of the more interesting things in your data. spaCy comes with a decaying utility function to facilitate this. You might try setting:
dropout = decaying(0.6, 0.2, 1e-4)
https://spacy.io/usage/training#annotations:
sgd: An optimizer, i.e. a callable to update the model’s weights. If not set, spaCy will create a new one and save it for further use.
The drop, sgd and size are some of the parameters you can customize to optimize your training.
drop is used to change the value of dropout.
size is used to change the size of the batch
sgd is used to change various hyperparameters such as learning rate, Adam beta1 and beta2 parameters, gradient clipping and L2 regularisation.
I consider the sgd to be a very important argument to experiment with.
To help you, I wrote a short blog post showing how to customize any spaCy parameters from your python interpreter (e.g. jupyter notebook). No command line interface required.