How to apply the learned model in Matlab after cross-validation - matlab

Once the classifier is trained and tested using cross-validation approach, how does one use the results to validate on an unseen data especially during free running stage / deployment stage? How does one use the learned model? the following code trains and tests the data X using cross-validation. How am I supposed to use the learned model after the line pred = predict(svmModel, X(istest,:)); is computed?
part = cvpartition(Y,'Holdout',0.5);
istrain = training(part); % Data for fitting
istest = test(part); % Data for quality assessment
balance_Train=tabulate(Y(istrain))
NumbTrain = sum(istrain); % Number of observations in the training sample
NumbTest = sum(istest);
svmModel = fitcsvm(X(istrain,:),Y(istrain), 'KernelFunction','rbf');
pred = predict(svmModel, X(istest,:));
% compute the confusion matrix
cmat = confusionmat(Y(istest),pred);
acc = 100*sum(diag(cmat))./sum(cmat(:))

The clue's in the name:
predict
Predict labels using support vector machine (SVM) classifier
Syntax
label = predict(SVMModel,X)
[label,score] = predict(SVMModel,X)
Description
label = predict(SVMModel,X) returns a vector of predicted class labels
for the predictor data in the table or matrix X, based on the trained
support vector machine (SVM) classification model SVMModel. The
trained SVM model can either be full or compact.
In the code in your question, the code from pred = ... onwards is there to evaluate the predictions made by your svmModel object. However you can take the same object and use it to make predictions with further input dataset(s) - or, better, train a second model using all the data, and use that model for making actual predictions on new, unknown inputs.
You seem to be unclear on the role of (cross-)validation in model building. You should build your deployment model using the whole dataset (X, as per your comment), because as a rule more data always gives you a better model. To estimate how good this deployment model will be, you build one or more models from subsets of X and test each model against the rest of X that wasn't in that model's training subset. If you only do this once, this is called holdout validation; if you use multiple subsets and average the outcomes it's cross-validation.
If it's important to you for some reason that the deployed model is exactly the same one that you used to obtain your validation results, then you can deploy the model that was trained on the training partition of your holdout. But as I said, more training data usually results in a better model.

Related

What is the difference between cvpartition and crossvalind

load fisheriris;
y = species; %label
X = meas;
%Create a random partition for a stratified 10-fold cross-validation.
c = cvpartition(y,'KFold',10);
% split training/testing sets
[trainIdx testIdx] = crossvalind('HoldOut', y, 0.6);
crossvalind is used to perform cross-validation by randomly splitting the entire feature set X into training and testing data by returning the indices. Using the indices, we can create train and test data as X(trainIdx,:) and X(testIdx,:) respectively. cvpartition also splits the data using methods such as stratified and non-stratified but it does not return the indices. I have not seen examples where crossvalind is a stratified or non-stratified technique.
Question: Can crossvalind and cvpartition be used together?
I want to do stratified cross-validation. But I don't understand how to divide the data sets into train and test and get the indices.
Cross-validation and train/test partitioning are two different ways of estimating the performance of a model, not different ways of building the model itself. Usually you should build a model using all the data that you have, but also use one of these techniques (which build and score one or more additional models using subsets of that data) to estimate how good the main model is likely to be.
Cross-validation averages the outcome of multiple train/test splits so is usually expected to give a more realistic i.e. more pessimistic estimate of model performance.
Of the two functions you mention,crossvalind appears to be specific to the Bioinformatics Toolbox and is rather old. The help for cvpartition gives an example of how to do a stratified cross-validation:
Examples
Use a 10-fold stratified cross validation to compute the
misclassification error for classify on iris data.
load('fisheriris');
CVO = cvpartition(species,'k',10);
err = zeros(CVO.NumTestSets,1);
for i = 1:CVO.NumTestSets
trIdx = CVO.training(i);
teIdx = CVO.test(i);
ytest = classify(meas(teIdx,:),meas(trIdx,:),...
species(trIdx,:));
err(i) = sum(~strcmp(ytest,species(teIdx)));
end
cvErr = sum(err)/sum(CVO.TestSize);

matlab predict function error with fitrtree model

I am trying to do regression with fitrtree model. It works fine without the validation but with the validation the predict function returns an error.
%works fine
tree = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','off');
y_hat = predict(tree, xNew);
%Returns error
tree = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','on');
y_hat = predict(tree, xNew);
Error: Systems of classreg.learning.partition.RegressionPartitionedModel class cannot be used with the "predict"
command. Convert the system to an identified model first, such as by using the "idss" command.
Update: I figured out that when we use cross validation of any sort, the model is in the Trained attribute of tree rather than the tree itself. what is this trained attribute (tree.Trained{1}) and what information do we get from it.?
If you choose a cross-validation method when calling fitrtree(), the output of the function is a RegressionPartitionedModel instead of a RegressionTree.
As you said, you can access objects of type RegressionTree stored in tree.Trained in your case. The number and meaning of the trees you find under this attribute depends on the cross-validation model. In your case, using Leave-one-out cross-validation (LOOCV), the Trained attribute contains N RegressionTree objects, where N is the number of data points in your training set. Each of these regression trees is obtained by training on all but one of your data points. The left out data point is used for testing.
For example, if you want to access the first and last trees obtained from cross-validation, and use them for separate predictions, you can do:
%Returns RegressionPartitionedModel
cv_trees = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','on');
%This is the number of regression trees stored in cv_trees for LOOCV
[N, ~] = size(trainingData);
%Use one of the models from the cross-validation as a predictor
y_hat = predict(tree.Trained{1}, xNew);
y_hat_2 = predict(tree.Trained{N}, xNew);

Predict labels for new dataset (Test data) using cross validated Knn classifier model in matlab

I have a training dataset (50000 X 16) and test dataset (5000 X 16)[the 16th column in both the datasets are decision labels or response. The decision label in test dataset in used for checking the classification accuracy of the trained classifier]. I am using my training data for training and validating my cross validated knn classifier. I have created a cross validated knn classifier model using the following code :
X = Dataset2(1:50000,:); % Use some data for fitting
Y = Training_Label(1:50000,:); % Response of training data
%Create a KNN Classifier model
rng(10); % For reproducibility
Mdl = fitcknn(X,Y,'Distance', 'Cosine', 'Exponent', '', 'NumNeighbors', 10,'DistanceWeight', 'Equal', 'StandardizeData', 1);
%Construct a cross-validated classifier from the model.
CVMdl = crossval(Mdl,'KFold', 10);
%Examine the cross-validation loss, which is the average loss of each cross-validation model when predicting on data that is not used for training.
kloss = kfoldLoss(CVMdl, 'LossFun', 'ClassifError')
% Compute validation accuracy
validationAccuracy = 1 - kloss;
now I want to classify my Test data using this cross validated knn classifier but can't really figure out how to do that. I have gone through the available examples in matlab but couldn't find any suitable function or examples for doing this.
I know I can use the "predict" function for predicting the classlabels of my test data if my classifier is not cross validated. The code is as following :
X = Dataset2(1:50000,:); % Use some data for fitting
Y = Training_Label(1:50000,:); % Response of training data
%Create a KNN Classifier model
rng(10); % For reproducibility
Mdl = fitcknn(X,Y,'Distance', 'Cosine', 'Exponent', '', 'NumNeighbors', 10,'DistanceWeight', 'Equal', 'StandardizeData', 1);
%Classification using Test Data
Classifier_Output_Labels = predict(Mdl,TestDataset2(1:5000,:));
But I could not find any similar function (like "predict") for cross validated trained knn classifier. I found out the "kfoldPredict" function in Matlab documentation, but it says the function is used to evaluate the trained model.
http://www.mathworks.com/help/stats/classificationpartitionedmodel.kfoldpredict.html But I did not find any input of the new data through this function.
So could anyone please advise me how to use the cross validated knn classifier model to predict labels of new data? Any help is appreciated and badly needed. :( :(
Let's say you are doing 10-fold cross validation while learning the model. You can then use the kfoldLoss function to also get the CV loss for each fold and then choose the trained model that gives you the least CV loss in the following way:
modelLosses = kfoldLoss(Mdl,'mode','individual');
The above code will give you a vector of length 10 (10 CV error values) if you have done 10-fold cross-validation while learning. Assuming the trained model with least CV error is the 'k'th one, you would then use:
testSetPredictions = predict(Mdl.Trained{k}, testSetFeatures);
You seem to be confusing things here. Cross validation is a tool for model selection and evaluation. It is not training procedure per se. Consequently you cannot "use" cross validated object. You predict using trained object. Cross validation is a form of estimating generalization capabilities of a given model, it has nothing to do with actual training, it is rather a small statistical experiment to asses a particular property.

Matlab Machine Learning Train, Validate, Test Partitions

I'm using Matlab's Statistics and Machine Learning Toolbox to create decision trees, ensembles, Knn models, etc. I would like to separate my data into training/testing partitions, then have the models train and cross validate using the training data (essentially splitting the training data into training and validation data) while preserving my testing data for error metrics. It is important that the models not be trained in any way using the testing data. For my decision tree, I have something like the following code:
chess = csvread(filename);
predictors = chess(:,1:6);
class = chess(:,7);
cvpart = cvpartition(class,'holdout', 0.3);
Xtrain = predictors(training(cvpart),:);
Ytrain = class(training(cvpart),:);
Xtest = predictors(test(cvpart),:);
Ytest = class(test(cvpart),:);
% Fit the decision tree
tree = fitctree(Xtrain, Ytrain, 'CrossVal', 'on');
% Error Metrics
testingLoss = loss(tree,Xtest,Ytest,'Subtrees','all'); % Testing
resubcost = resubLoss(tree,'Subtrees','all'); % Training
[cost,secost,ntermnodes,bestlevel] = cvloss(tree,'Subtrees','all'); % Cross Val
However, this returns
Undefined function 'loss' for input arguments of
type 'classreg.learning.partition.ClassificationPartitionedModel'.
when attempting to find the testing error. I have tried several combinations of similar methods using different types of classification algorithms, but keep coming back to not being able to apply test data to a cross validated model due to partitioned data. How am I supposed to apply test data to a cross validated model?
When you use cross validation in the call to fitctree, by default 10 model folds are constructed within the 70% of data used to train the model. You can find the kFoldLoss (within each model fold) via:
modelLoss = kfoldLoss(tree);
Since the original call to fitctree constructed 10 model folds, there are 10 separate trained models. Each of the 10 models is contained within a cell array, located at tree.Trained . For for example you could use the first trained model to test the loss on your held out data via:
testingLoss = loss(tree.Trained{1},Xtest,Ytest,'Subtrees','all'); % Testing
You can use the kfoldLoss function to also get the CV loss for each fold and then choose the trained model that gives you the least CV loss in the following way:
modelLosses = kfoldLoss(tree,'mode','individual');
The above code will give you a vector of length 10 if you have done 10-fold cross-validation while learning. Assuming the trained model with least CV error is the 'k'th one, you would then use:
testSetPredictions = predict(tree.Trained{k}, testSetFeatures);

Implementing Logistic Regression in MATLAB

I have a data set of 13 attributes where some are categorical and some are continuous (can be converted to categorical). I need to use logistic regression to create a model that predicts the responses of a row and find the prediction's accuracy, sensitivity, and specificity.
Can/Should I use cross validation to divide my data set and get the results?
Is there any code sample on how to go about doing this? (I'm new to all of this)
Should I be using mnrfit/mnrval or glmfit/glmval? What's the difference and how do I choose?
Thanks!
If you want to determine how well the model can predict unseen data you can use cross validation. In Matlab, you can use glmfit to fit the logistic regression model and glmval to test it.
Here is a sample of Matlab code that illustrates how to do it, where X is the feature matrix and Labels is the class label for each case, num_shuffles is the number of repetitions of the cross-validation while num_folds is the number of folds:
for j = 1:num_shuffles
indices = crossvalind('Kfold',Labels,num_folds);
for i = 1:num_folds
test = (indices == i); train = ~test;
[b,dev,stats] = glmfit(X(train,:),Labels(train),'binomial','logit'); % Logistic regression
Fit(j,i) = glmval(b,X(test,:),'logit')';
end
end
Fit is then the fitted logistic regression estimate for each test fold. Thresholding this will yield an estimate of the predicted class for each test case. Performance measures are then calculated by comparing the predicted class label against the actual class label. Averaging the performance measures across all folds and repetitions gives an estimate of the model performance on unseen data.
originally answered by BGreene on #Stats.SE.