I do not understand what the function "crossval" in matlab takes as first parameter, I understand that it is a function for performing a regression, but I don´t get what is intended as "some criterion testval". I need to use it on a K-nn regressor, but the examples are not making everything clear to me.
vals = crossval(fun,X)
Each time it is called, fun should use XTRAIN to fit a model, then
return some criterion testval computed on XTEST using that fitted
model.
Here is where I am reading: Matlab reference
It should be similar to optimization functions, where the returned value from your fitting function fun should be an indication of how well it fits the data. As the documentation states, fun takes two arguments, a training data set XTRAIN and a testing data set XTEST.
If your data, X, comprises a column of known results X(:,1) and other columns of features X(:, 2:end), and train your data using XTRAIN, then your return value could be as simple as the sum-squared error of the fitted model:
testval = sum( (model(XTEST(:, 2:end)) - XTEST(:, 1)).^2 );
where model(XTEST(:, 2:end)) is the result of your fitted model on the features of the testing data set, XTEST, and XTEST(:, 1) are the known results for those feature sets.
Related
I am trying to get the 5-fold cross validation error of a model created with TreeBagger using the function crossval but I keep getting an error
Error using crossval>evalFun
The function 'regrTree' generated the following error:
Too many input arguments.
My code is below. Can anyone point me in the right direction? Thanks
%Random Forest
%%XX is training data matrix, Y is training labels vector
XX=X_Tbl(:,2:end);
Forest_Mdl = TreeBagger(1000,XX,Y,'Method','regression');
err_std = crossval('mse',XX,Y,'Predfun',#regrTree, 'kFold',5);
function yfit_std = regrTree(Forest_Mdl,XX)
yfit_std = predict(Forest_Mdl,XX);
end
Reading the documentation helps a lot!:
The function has to be defined as:
(note that it takes 3 arguments, not 2)
function yfit = myfunction(Xtrain,ytrain,Xtest)
% Calculate predicted response
...
end
Xtrain — Subset of the observations in X used as training predictor
data. The function uses Xtrain and ytrain to construct a
classification or regression model.
ytrain — Subset of the responses in y used as training response data.
The rows of ytrain correspond to the same observations in the rows of
Xtrain. The function uses Xtrain and ytrain to construct a
classification or regression model.
Xtest — Subset of the observations in X used as test predictor data.
The function uses Xtest and the model trained on Xtrain and ytrain to
compute the predicted values yfit.
yfit — Set of predicted values for observations in Xtest. The yfit
values form a column vector with the same number of rows as Xtest.
I am trying to use a Support Vector Machine to classify my data in 3 classes. I used this Matlab function to train and cross-validate the SVM:
Mdl = fitcecoc(XTrain, yTrain, 'Learners', 'svm', 'ObservationsIn', 'rows', ...
'ScoreTransform', 'invlogit','Crossval','on', 'Holdout', 0.2);
where XTrain contains all of my data and yTrain is a cell containing the names of each class to be assigned to the input data in XTrain.
The function above returns to me:
Mdl --> 1x1 ClassificationPartitionedECOC
My question is, what function do I have to use in order to make predictions using new data? In the case of binary classification, I build the SVM with 'fitcsvm' and then I predicted the labels with:
[label, score] = predict(Mdl, XTest);
However, if I feed the ClassificationPartitionedECOC to the 'predict' function, it gives me this error:
No valid system or dataset was specified.
I haven't been able to find a function that allows me to perform prediction starting from the model format that I have, ClassificationPartitionedECOC.
Thanks for any help you may provide!
You can access the learner i through:
Mdl.BinaryLearners{i}
Because fitcecoc just trains a binary classifier like you would do with fitCSVM in a one versus one fashion.
I am trying to do regression with fitrtree model. It works fine without the validation but with the validation the predict function returns an error.
%works fine
tree = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','off');
y_hat = predict(tree, xNew);
%Returns error
tree = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','on');
y_hat = predict(tree, xNew);
Error: Systems of classreg.learning.partition.RegressionPartitionedModel class cannot be used with the "predict"
command. Convert the system to an identified model first, such as by using the "idss" command.
Update: I figured out that when we use cross validation of any sort, the model is in the Trained attribute of tree rather than the tree itself. what is this trained attribute (tree.Trained{1}) and what information do we get from it.?
If you choose a cross-validation method when calling fitrtree(), the output of the function is a RegressionPartitionedModel instead of a RegressionTree.
As you said, you can access objects of type RegressionTree stored in tree.Trained in your case. The number and meaning of the trees you find under this attribute depends on the cross-validation model. In your case, using Leave-one-out cross-validation (LOOCV), the Trained attribute contains N RegressionTree objects, where N is the number of data points in your training set. Each of these regression trees is obtained by training on all but one of your data points. The left out data point is used for testing.
For example, if you want to access the first and last trees obtained from cross-validation, and use them for separate predictions, you can do:
%Returns RegressionPartitionedModel
cv_trees = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','on');
%This is the number of regression trees stored in cv_trees for LOOCV
[N, ~] = size(trainingData);
%Use one of the models from the cross-validation as a predictor
y_hat = predict(tree.Trained{1}, xNew);
y_hat_2 = predict(tree.Trained{N}, xNew);
I use the sparse Gaussian process for regression from Rasmussen.
[http://www.tsc.uc3m.es/~miguel/downloads.php][1]
The syntax for predicting the mean is:
[~, mu_1, ~, ~, loghyper] = ssgpr_ui(Xtrain, Ytrain, Xtest, Ytest, m);
My question is, the author states that the initial hyper parameter search condition is different for different iterations, hence the results of the model is different from every iteration. Is there any way to ensure that the best initialization or seed condition is set to have good quality hyper parameters for best predictions and reproducible results?
In order to obtain the same predictions every time, it is possible to set the seed by
stream = RandStream('mt19937ar','Seed',123456);
RandStream.setGlobalStream(stream);
However, there is no standard procedure to set the best seed. Doing so will lead to over fitting of the model as we are giving too much of ideal conditions to fit the training data as quoted by #mikkola
I cannot follow crossval() & cvpartition() function given in MATLAB documentation crossval(). What goes in the parameter and how would it help to compare performance and accuracy of different classifiers. Would be obliged if a simpler version of it is provided here.
Let's work on Example 2 from CROSSVAL documentation.
load('fisheriris');
y = species;
X = meas;
Here we loaded the data from example mat-file and assigned variable to X and y. meas amtrix contains different measurements of iris flowers and species are tree classes of iris, what we are trying to predict with the data.
Cross-validation is used to train a classifier on the same data set many times. Basically at each iteration you split the data set to training and test data. The proportion is determined by k-fold. For example, if k is 10, 90% of the data will be used for training, and the rest 10% - for test, and you will have 10 iterations. This is done by CVPARTITION function.
cp = cvpartition(y,'k',10); % Stratified cross-validation
You can explore cp object if you type cp. and press Tab. You will see different properties and methods. For example, find(cp.test(1)) will show indices of the test set for 1st iteration.
Next step is to prepare prediction function. This is probably where you had the main problem. This statement create function handle using anonymous function. #(XTRAIN, ytrain,XTEST) part declare that this function has 3 input arguments. Next part (classify(XTEST,XTRAIN,ytrain)) defines the function, which gets training data XTRAIN with known ytrain classes and predicts classes for XTEST data with generated model. (Those data are from cp, remember?)
classf = #(XTRAIN, ytrain,XTEST)(classify(XTEST,XTRAIN,ytrain));
Then we are running CROSSVAL function to estimate missclassification rate (mcr) passing the complete data set, prediction function handle and partitioning object cp.
cvMCR = crossval('mcr',X,y,'predfun',classf,'partition',cp)
cvMCR =
0.0200
Still have questions?