I am trying to do regression with fitrtree model. It works fine without the validation but with the validation the predict function returns an error.
%works fine
tree = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','off');
y_hat = predict(tree, xNew);
%Returns error
tree = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','on');
y_hat = predict(tree, xNew);
Error: Systems of classreg.learning.partition.RegressionPartitionedModel class cannot be used with the "predict"
command. Convert the system to an identified model first, such as by using the "idss" command.
Update: I figured out that when we use cross validation of any sort, the model is in the Trained attribute of tree rather than the tree itself. what is this trained attribute (tree.Trained{1}) and what information do we get from it.?
If you choose a cross-validation method when calling fitrtree(), the output of the function is a RegressionPartitionedModel instead of a RegressionTree.
As you said, you can access objects of type RegressionTree stored in tree.Trained in your case. The number and meaning of the trees you find under this attribute depends on the cross-validation model. In your case, using Leave-one-out cross-validation (LOOCV), the Trained attribute contains N RegressionTree objects, where N is the number of data points in your training set. Each of these regression trees is obtained by training on all but one of your data points. The left out data point is used for testing.
For example, if you want to access the first and last trees obtained from cross-validation, and use them for separate predictions, you can do:
%Returns RegressionPartitionedModel
cv_trees = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','on');
%This is the number of regression trees stored in cv_trees for LOOCV
[N, ~] = size(trainingData);
%Use one of the models from the cross-validation as a predictor
y_hat = predict(tree.Trained{1}, xNew);
y_hat_2 = predict(tree.Trained{N}, xNew);
Related
I am trying to use a Support Vector Machine to classify my data in 3 classes. I used this Matlab function to train and cross-validate the SVM:
Mdl = fitcecoc(XTrain, yTrain, 'Learners', 'svm', 'ObservationsIn', 'rows', ...
'ScoreTransform', 'invlogit','Crossval','on', 'Holdout', 0.2);
where XTrain contains all of my data and yTrain is a cell containing the names of each class to be assigned to the input data in XTrain.
The function above returns to me:
Mdl --> 1x1 ClassificationPartitionedECOC
My question is, what function do I have to use in order to make predictions using new data? In the case of binary classification, I build the SVM with 'fitcsvm' and then I predicted the labels with:
[label, score] = predict(Mdl, XTest);
However, if I feed the ClassificationPartitionedECOC to the 'predict' function, it gives me this error:
No valid system or dataset was specified.
I haven't been able to find a function that allows me to perform prediction starting from the model format that I have, ClassificationPartitionedECOC.
Thanks for any help you may provide!
You can access the learner i through:
Mdl.BinaryLearners{i}
Because fitcecoc just trains a binary classifier like you would do with fitCSVM in a one versus one fashion.
Once the classifier is trained and tested using cross-validation approach, how does one use the results to validate on an unseen data especially during free running stage / deployment stage? How does one use the learned model? the following code trains and tests the data X using cross-validation. How am I supposed to use the learned model after the line pred = predict(svmModel, X(istest,:)); is computed?
part = cvpartition(Y,'Holdout',0.5);
istrain = training(part); % Data for fitting
istest = test(part); % Data for quality assessment
balance_Train=tabulate(Y(istrain))
NumbTrain = sum(istrain); % Number of observations in the training sample
NumbTest = sum(istest);
svmModel = fitcsvm(X(istrain,:),Y(istrain), 'KernelFunction','rbf');
pred = predict(svmModel, X(istest,:));
% compute the confusion matrix
cmat = confusionmat(Y(istest),pred);
acc = 100*sum(diag(cmat))./sum(cmat(:))
The clue's in the name:
predict
Predict labels using support vector machine (SVM) classifier
Syntax
label = predict(SVMModel,X)
[label,score] = predict(SVMModel,X)
Description
label = predict(SVMModel,X) returns a vector of predicted class labels
for the predictor data in the table or matrix X, based on the trained
support vector machine (SVM) classification model SVMModel. The
trained SVM model can either be full or compact.
In the code in your question, the code from pred = ... onwards is there to evaluate the predictions made by your svmModel object. However you can take the same object and use it to make predictions with further input dataset(s) - or, better, train a second model using all the data, and use that model for making actual predictions on new, unknown inputs.
You seem to be unclear on the role of (cross-)validation in model building. You should build your deployment model using the whole dataset (X, as per your comment), because as a rule more data always gives you a better model. To estimate how good this deployment model will be, you build one or more models from subsets of X and test each model against the rest of X that wasn't in that model's training subset. If you only do this once, this is called holdout validation; if you use multiple subsets and average the outcomes it's cross-validation.
If it's important to you for some reason that the deployed model is exactly the same one that you used to obtain your validation results, then you can deploy the model that was trained on the training partition of your holdout. But as I said, more training data usually results in a better model.
I have a training dataset (50000 X 16) and test dataset (5000 X 16)[the 16th column in both the datasets are decision labels or response. The decision label in test dataset in used for checking the classification accuracy of the trained classifier]. I am using my training data for training and validating my cross validated knn classifier. I have created a cross validated knn classifier model using the following code :
X = Dataset2(1:50000,:); % Use some data for fitting
Y = Training_Label(1:50000,:); % Response of training data
%Create a KNN Classifier model
rng(10); % For reproducibility
Mdl = fitcknn(X,Y,'Distance', 'Cosine', 'Exponent', '', 'NumNeighbors', 10,'DistanceWeight', 'Equal', 'StandardizeData', 1);
%Construct a cross-validated classifier from the model.
CVMdl = crossval(Mdl,'KFold', 10);
%Examine the cross-validation loss, which is the average loss of each cross-validation model when predicting on data that is not used for training.
kloss = kfoldLoss(CVMdl, 'LossFun', 'ClassifError')
% Compute validation accuracy
validationAccuracy = 1 - kloss;
now I want to classify my Test data using this cross validated knn classifier but can't really figure out how to do that. I have gone through the available examples in matlab but couldn't find any suitable function or examples for doing this.
I know I can use the "predict" function for predicting the classlabels of my test data if my classifier is not cross validated. The code is as following :
X = Dataset2(1:50000,:); % Use some data for fitting
Y = Training_Label(1:50000,:); % Response of training data
%Create a KNN Classifier model
rng(10); % For reproducibility
Mdl = fitcknn(X,Y,'Distance', 'Cosine', 'Exponent', '', 'NumNeighbors', 10,'DistanceWeight', 'Equal', 'StandardizeData', 1);
%Classification using Test Data
Classifier_Output_Labels = predict(Mdl,TestDataset2(1:5000,:));
But I could not find any similar function (like "predict") for cross validated trained knn classifier. I found out the "kfoldPredict" function in Matlab documentation, but it says the function is used to evaluate the trained model.
http://www.mathworks.com/help/stats/classificationpartitionedmodel.kfoldpredict.html But I did not find any input of the new data through this function.
So could anyone please advise me how to use the cross validated knn classifier model to predict labels of new data? Any help is appreciated and badly needed. :( :(
Let's say you are doing 10-fold cross validation while learning the model. You can then use the kfoldLoss function to also get the CV loss for each fold and then choose the trained model that gives you the least CV loss in the following way:
modelLosses = kfoldLoss(Mdl,'mode','individual');
The above code will give you a vector of length 10 (10 CV error values) if you have done 10-fold cross-validation while learning. Assuming the trained model with least CV error is the 'k'th one, you would then use:
testSetPredictions = predict(Mdl.Trained{k}, testSetFeatures);
You seem to be confusing things here. Cross validation is a tool for model selection and evaluation. It is not training procedure per se. Consequently you cannot "use" cross validated object. You predict using trained object. Cross validation is a form of estimating generalization capabilities of a given model, it has nothing to do with actual training, it is rather a small statistical experiment to asses a particular property.
I'm using Matlab's Statistics and Machine Learning Toolbox to create decision trees, ensembles, Knn models, etc. I would like to separate my data into training/testing partitions, then have the models train and cross validate using the training data (essentially splitting the training data into training and validation data) while preserving my testing data for error metrics. It is important that the models not be trained in any way using the testing data. For my decision tree, I have something like the following code:
chess = csvread(filename);
predictors = chess(:,1:6);
class = chess(:,7);
cvpart = cvpartition(class,'holdout', 0.3);
Xtrain = predictors(training(cvpart),:);
Ytrain = class(training(cvpart),:);
Xtest = predictors(test(cvpart),:);
Ytest = class(test(cvpart),:);
% Fit the decision tree
tree = fitctree(Xtrain, Ytrain, 'CrossVal', 'on');
% Error Metrics
testingLoss = loss(tree,Xtest,Ytest,'Subtrees','all'); % Testing
resubcost = resubLoss(tree,'Subtrees','all'); % Training
[cost,secost,ntermnodes,bestlevel] = cvloss(tree,'Subtrees','all'); % Cross Val
However, this returns
Undefined function 'loss' for input arguments of
type 'classreg.learning.partition.ClassificationPartitionedModel'.
when attempting to find the testing error. I have tried several combinations of similar methods using different types of classification algorithms, but keep coming back to not being able to apply test data to a cross validated model due to partitioned data. How am I supposed to apply test data to a cross validated model?
When you use cross validation in the call to fitctree, by default 10 model folds are constructed within the 70% of data used to train the model. You can find the kFoldLoss (within each model fold) via:
modelLoss = kfoldLoss(tree);
Since the original call to fitctree constructed 10 model folds, there are 10 separate trained models. Each of the 10 models is contained within a cell array, located at tree.Trained . For for example you could use the first trained model to test the loss on your held out data via:
testingLoss = loss(tree.Trained{1},Xtest,Ytest,'Subtrees','all'); % Testing
You can use the kfoldLoss function to also get the CV loss for each fold and then choose the trained model that gives you the least CV loss in the following way:
modelLosses = kfoldLoss(tree,'mode','individual');
The above code will give you a vector of length 10 if you have done 10-fold cross-validation while learning. Assuming the trained model with least CV error is the 'k'th one, you would then use:
testSetPredictions = predict(tree.Trained{k}, testSetFeatures);
I cannot follow crossval() & cvpartition() function given in MATLAB documentation crossval(). What goes in the parameter and how would it help to compare performance and accuracy of different classifiers. Would be obliged if a simpler version of it is provided here.
Let's work on Example 2 from CROSSVAL documentation.
load('fisheriris');
y = species;
X = meas;
Here we loaded the data from example mat-file and assigned variable to X and y. meas amtrix contains different measurements of iris flowers and species are tree classes of iris, what we are trying to predict with the data.
Cross-validation is used to train a classifier on the same data set many times. Basically at each iteration you split the data set to training and test data. The proportion is determined by k-fold. For example, if k is 10, 90% of the data will be used for training, and the rest 10% - for test, and you will have 10 iterations. This is done by CVPARTITION function.
cp = cvpartition(y,'k',10); % Stratified cross-validation
You can explore cp object if you type cp. and press Tab. You will see different properties and methods. For example, find(cp.test(1)) will show indices of the test set for 1st iteration.
Next step is to prepare prediction function. This is probably where you had the main problem. This statement create function handle using anonymous function. #(XTRAIN, ytrain,XTEST) part declare that this function has 3 input arguments. Next part (classify(XTEST,XTRAIN,ytrain)) defines the function, which gets training data XTRAIN with known ytrain classes and predicts classes for XTEST data with generated model. (Those data are from cp, remember?)
classf = #(XTRAIN, ytrain,XTEST)(classify(XTEST,XTRAIN,ytrain));
Then we are running CROSSVAL function to estimate missclassification rate (mcr) passing the complete data set, prediction function handle and partitioning object cp.
cvMCR = crossval('mcr',X,y,'predfun',classf,'partition',cp)
cvMCR =
0.0200
Still have questions?