What is the difference between cvpartition and crossvalind - matlab

load fisheriris;
y = species; %label
X = meas;
%Create a random partition for a stratified 10-fold cross-validation.
c = cvpartition(y,'KFold',10);
% split training/testing sets
[trainIdx testIdx] = crossvalind('HoldOut', y, 0.6);
crossvalind is used to perform cross-validation by randomly splitting the entire feature set X into training and testing data by returning the indices. Using the indices, we can create train and test data as X(trainIdx,:) and X(testIdx,:) respectively. cvpartition also splits the data using methods such as stratified and non-stratified but it does not return the indices. I have not seen examples where crossvalind is a stratified or non-stratified technique.
Question: Can crossvalind and cvpartition be used together?
I want to do stratified cross-validation. But I don't understand how to divide the data sets into train and test and get the indices.

Cross-validation and train/test partitioning are two different ways of estimating the performance of a model, not different ways of building the model itself. Usually you should build a model using all the data that you have, but also use one of these techniques (which build and score one or more additional models using subsets of that data) to estimate how good the main model is likely to be.
Cross-validation averages the outcome of multiple train/test splits so is usually expected to give a more realistic i.e. more pessimistic estimate of model performance.
Of the two functions you mention,crossvalind appears to be specific to the Bioinformatics Toolbox and is rather old. The help for cvpartition gives an example of how to do a stratified cross-validation:
Examples
Use a 10-fold stratified cross validation to compute the
misclassification error for classify on iris data.
load('fisheriris');
CVO = cvpartition(species,'k',10);
err = zeros(CVO.NumTestSets,1);
for i = 1:CVO.NumTestSets
trIdx = CVO.training(i);
teIdx = CVO.test(i);
ytest = classify(meas(teIdx,:),meas(trIdx,:),...
species(trIdx,:));
err(i) = sum(~strcmp(ytest,species(teIdx)));
end
cvErr = sum(err)/sum(CVO.TestSize);

Related

How to apply the learned model in Matlab after cross-validation

Once the classifier is trained and tested using cross-validation approach, how does one use the results to validate on an unseen data especially during free running stage / deployment stage? How does one use the learned model? the following code trains and tests the data X using cross-validation. How am I supposed to use the learned model after the line pred = predict(svmModel, X(istest,:)); is computed?
part = cvpartition(Y,'Holdout',0.5);
istrain = training(part); % Data for fitting
istest = test(part); % Data for quality assessment
balance_Train=tabulate(Y(istrain))
NumbTrain = sum(istrain); % Number of observations in the training sample
NumbTest = sum(istest);
svmModel = fitcsvm(X(istrain,:),Y(istrain), 'KernelFunction','rbf');
pred = predict(svmModel, X(istest,:));
% compute the confusion matrix
cmat = confusionmat(Y(istest),pred);
acc = 100*sum(diag(cmat))./sum(cmat(:))
The clue's in the name:
predict
Predict labels using support vector machine (SVM) classifier
Syntax
label = predict(SVMModel,X)
[label,score] = predict(SVMModel,X)
Description
label = predict(SVMModel,X) returns a vector of predicted class labels
for the predictor data in the table or matrix X, based on the trained
support vector machine (SVM) classification model SVMModel. The
trained SVM model can either be full or compact.
In the code in your question, the code from pred = ... onwards is there to evaluate the predictions made by your svmModel object. However you can take the same object and use it to make predictions with further input dataset(s) - or, better, train a second model using all the data, and use that model for making actual predictions on new, unknown inputs.
You seem to be unclear on the role of (cross-)validation in model building. You should build your deployment model using the whole dataset (X, as per your comment), because as a rule more data always gives you a better model. To estimate how good this deployment model will be, you build one or more models from subsets of X and test each model against the rest of X that wasn't in that model's training subset. If you only do this once, this is called holdout validation; if you use multiple subsets and average the outcomes it's cross-validation.
If it's important to you for some reason that the deployed model is exactly the same one that you used to obtain your validation results, then you can deploy the model that was trained on the training partition of your holdout. But as I said, more training data usually results in a better model.

Calculating Mutual Information between features in Matlab

I need to calculate the mutual information between various features for designing a classification model using logistic regression. I am facing following problems:
I need to divide my data into n bins having approximately equal number of samples. How can I achieve this in Matlab?
Should I perform the above discretization on raw data or normalized data?
Thanks.
I guess what you want to do is similar to cross validation, in Matlab you can use the function crossvalind that allow you to split your dataset.
I added the example shown on the page for splitting your data into 10 bins (called 10-fold cross validation).
load fisheriris
indices = crossvalind('Kfold',species,10);
cp = classperf(species);
for i = 1:10
test = (indices == i); train = ~test;
class = classify(meas(test,:),meas(train,:),species(train,:));
classperf(cp,class,test)
end
cp.ErrorRate
ans =
0.0200
You should do this operation after doing the pre-processing of your data (normalisation / standardisation).

Matlab Machine Learning Train, Validate, Test Partitions

I'm using Matlab's Statistics and Machine Learning Toolbox to create decision trees, ensembles, Knn models, etc. I would like to separate my data into training/testing partitions, then have the models train and cross validate using the training data (essentially splitting the training data into training and validation data) while preserving my testing data for error metrics. It is important that the models not be trained in any way using the testing data. For my decision tree, I have something like the following code:
chess = csvread(filename);
predictors = chess(:,1:6);
class = chess(:,7);
cvpart = cvpartition(class,'holdout', 0.3);
Xtrain = predictors(training(cvpart),:);
Ytrain = class(training(cvpart),:);
Xtest = predictors(test(cvpart),:);
Ytest = class(test(cvpart),:);
% Fit the decision tree
tree = fitctree(Xtrain, Ytrain, 'CrossVal', 'on');
% Error Metrics
testingLoss = loss(tree,Xtest,Ytest,'Subtrees','all'); % Testing
resubcost = resubLoss(tree,'Subtrees','all'); % Training
[cost,secost,ntermnodes,bestlevel] = cvloss(tree,'Subtrees','all'); % Cross Val
However, this returns
Undefined function 'loss' for input arguments of
type 'classreg.learning.partition.ClassificationPartitionedModel'.
when attempting to find the testing error. I have tried several combinations of similar methods using different types of classification algorithms, but keep coming back to not being able to apply test data to a cross validated model due to partitioned data. How am I supposed to apply test data to a cross validated model?
When you use cross validation in the call to fitctree, by default 10 model folds are constructed within the 70% of data used to train the model. You can find the kFoldLoss (within each model fold) via:
modelLoss = kfoldLoss(tree);
Since the original call to fitctree constructed 10 model folds, there are 10 separate trained models. Each of the 10 models is contained within a cell array, located at tree.Trained . For for example you could use the first trained model to test the loss on your held out data via:
testingLoss = loss(tree.Trained{1},Xtest,Ytest,'Subtrees','all'); % Testing
You can use the kfoldLoss function to also get the CV loss for each fold and then choose the trained model that gives you the least CV loss in the following way:
modelLosses = kfoldLoss(tree,'mode','individual');
The above code will give you a vector of length 10 if you have done 10-fold cross-validation while learning. Assuming the trained model with least CV error is the 'k'th one, you would then use:
testSetPredictions = predict(tree.Trained{k}, testSetFeatures);

How to use SVM in Matlab?

I am new to Matlab. Is there any sample code for classifying some data (with 41 features) with a SVM and then visualize the result? I want to classify a data set (which has five classes) using the SVM method.
I read the "A Practical Guide to Support Vector Classication" article and I saw some examples. My dataset is kdd99. I wrote the following code:
%% Load Data
[data,colNames] = xlsread('TarainingDataset.xls');
groups = ismember(colNames(:,42),'normal.');
TrainInputs = data;
TrainTargets = groups;
%% Design SVM
C = 100;
svmstruct = svmtrain(TrainInputs,TrainTargets,...
'boxconstraint',C,...
'kernel_function','rbf',...
'rbf_sigma',0.5,...
'showplot','false');
%% Test SVM
[dataTset,colNamesTest] = xlsread('TestDataset.xls');
TestInputs = dataTset;
groups = ismember(colNamesTest(:,42),'normal.');
TestOutputs = svmclassify(svmstruct,TestInputs,'showplot','false');
but I don't know that how to get accuracy or mse of my classification, and I use showplot in my svmclassify but when is true, I get this warning:
The display option can only plot 2D training data
Could anyone please help me?
I recommend you to use another SVM toolbox,libsvm. The link is as follow:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
After adding it to the path of matlab, you can train and use you model like this:
model=svmtrain(train_label,train_feature,'-c 1 -g 0.07 -h 0');
% the parameters can be modified
[label, accuracy, probablity]=svmpredict(test_label,test_feaure,model);
train_label must be a vector,if there are more than two kinds of input(0/1),it will be an nSVM automatically.
train_feature is n*L matrix for n samples. You'd better preprocess the feature before using it. In the test part, they should be preprocess in the same way.
The accuracy you want will be showed when test is finished, but it's only for the whole dataset.
If you need the accuracy for positive and negative samples separately, you still should calculate by yourself using the label predicted.
Hope this will help you!
Your feature space has 41 dimensions, plotting more that 3 dimensions is impossible.
In order to better understand your data and the way SVM works is to begin with a linear SVM. This tybe of SVM is interpretable, which means that each of your 41 features has a weight (or 'importance') associated with it after training. You can then use plot3() with your data on 3 of the 'best' features from the linear svm. Note how well your data is separated with those features and choose a basis function and other parameters accordingly.

Implementing Logistic Regression in MATLAB

I have a data set of 13 attributes where some are categorical and some are continuous (can be converted to categorical). I need to use logistic regression to create a model that predicts the responses of a row and find the prediction's accuracy, sensitivity, and specificity.
Can/Should I use cross validation to divide my data set and get the results?
Is there any code sample on how to go about doing this? (I'm new to all of this)
Should I be using mnrfit/mnrval or glmfit/glmval? What's the difference and how do I choose?
Thanks!
If you want to determine how well the model can predict unseen data you can use cross validation. In Matlab, you can use glmfit to fit the logistic regression model and glmval to test it.
Here is a sample of Matlab code that illustrates how to do it, where X is the feature matrix and Labels is the class label for each case, num_shuffles is the number of repetitions of the cross-validation while num_folds is the number of folds:
for j = 1:num_shuffles
indices = crossvalind('Kfold',Labels,num_folds);
for i = 1:num_folds
test = (indices == i); train = ~test;
[b,dev,stats] = glmfit(X(train,:),Labels(train),'binomial','logit'); % Logistic regression
Fit(j,i) = glmval(b,X(test,:),'logit')';
end
end
Fit is then the fitted logistic regression estimate for each test fold. Thresholding this will yield an estimate of the predicted class for each test case. Performance measures are then calculated by comparing the predicted class label against the actual class label. Averaging the performance measures across all folds and repetitions gives an estimate of the model performance on unseen data.
originally answered by BGreene on #Stats.SE.