Cross Validation Using libsvm - matlab

I am currently performing 5 - fold cross validation where I am using this code :
%# read some training data
[labels,data] = libsvmread('Training_Data_libsvmFormat.txt');
%# grid of parameters
folds = 5;
[C,gamma] = meshgrid(-5:2:15, -15:2:3) %Coarse Grid Search: bestC = 8 bestGamma = 2
%[C,gamma] = meshgrid(1:0.5:4, -1:0.25:3) %Fine Grid Search: bestC = 4 bestGamma = 2
%# grid search, and cross-validation
cv_acc = zeros(numel(C),1);
for i=1:numel(C)
cv_acc(i) = svmtrain(labels, data, sprintf('-c %f -g %f -v %d', 2^C(i), 2^gamma(i), folds));
end
%# pair (C,gamma) with best accuracy
[~,idx] = max(cv_acc);
%# contour plot of paramter selection
contour(C, gamma, reshape(cv_acc,size(C))), colorbar
hold on
plot(C(idx), gamma(idx), 'rx')
text(C(idx), gamma(idx), sprintf('Acc = %.2f %%',cv_acc(idx)), 'HorizontalAlign','left', 'VerticalAlign','top')
hold off
xlabel('log_2(C)'), ylabel('log_2(\gamma)'), title('Cross-Validation Accuracy')
%# now you can train you model using best_C and best_gamma
best_C = 2^C(idx);
best_gamma = 2^gamma(idx);
Now, I know that in 5 fold cross validation, 4/5 of dataset are used for training and 1/5 for testing and all the time changing the testing part to obtain the best cross C and gamma for RBF. However, in the dataset the 1st 1000 examples are positive while the last 3000 are all negative. Does cross validation using svmtrain() shuffle the data or it may be the case that the 1/5 for testing contains all negative examples please? I am asking this question as if it does not shuffle the data, the accuracy is not realistic.
I appreciate you assistance.

Related

Cross-Validation with libsvm to find best parameters

In order to find the best parameters to be used with libsvm I used the code below. Instead of './heart_scale' I had a file containing positive and negative examples each with a hog vector in libsvm format. I had 1000 positive examples and 4000 negative. However these were put in order, i.e. the 1st 1000 examples were positive examples and the others were negative.
Question: Now, I came in doubt whether the accuracy returned by this code is actual accuracy. This is because when I read on 5 fold cross-validation, it takes the first 4/5 of the data as training and the 1/5 left for testing. Does this mean that it can be the case the testing set is all negative? Or it takes the examples randomly please?
%# read some training data
[labels,data] = libsvmread('./heart_scale');
%# grid of parameters
folds = 5;
[C,gamma] = meshgrid(-5:2:15, -15:2:3);
%# grid search, and cross-validation
cv_acc = zeros(numel(C),1);
for i=1:numel(C)
cv_acc(i) = svmtrain(labels, data, ...
sprintf('-c %f -g %f -v %d', 2^C(i), 2^gamma(i), folds));
end
%# pair (C,gamma) with best accuracy
[~,idx] = max(cv_acc);
%# contour plot of paramter selection
contour(C, gamma, reshape(cv_acc,size(C))), colorbar
hold on
plot(C(idx), gamma(idx), 'rx')
text(C(idx), gamma(idx), sprintf('Acc = %.2f %%',cv_acc(idx)), ...
'HorizontalAlign','left', 'VerticalAlign','top')
hold off
xlabel('log_2(C)'), ylabel('log_2(\gamma)'), title('Cross-Validation Accuracy')
%# now you can train you model using best_C and best_gamma
best_C = 2^C(idx);
best_gamma = 2^gamma(idx);
%# ...
You can find answer to your question in the LIBSVM source code.
See the function svm_cross_validation in the svm.cpp.
As you can see, for classification cross-validation problem LIBSVM firstly performs class grouping and than shuffling.
So, answer to your question: yes, the accuracy returned by this code is actual accuracy.
Note: the accuracy estimation depends also on data nature, cross-validation folds number and itself is a random value with some distribution.

How to get the error vs. epochs (iterations) plot in matlab when using svm classification?

I use svmtrain to train my data set and svmclassify to predict test set. I want to look at the optimization process, the error vs. epochs (iterations) plot. I look into the usage and the code and find out that there are no information regarding such problem. The only thing I can get is control of the Maximum Iteration.
How to get the error vs. epochs (iterations) plot in matlab when using SVM classification?
Here is the code I modified. But not the one I want, I want the error at each epoch. Anybody did such analysis before? Thank you.
Best regards!
%# load dataset
load fisheriris %# load iris dataset
Groups = ismember(species,'setosa'); %# create a two-class problem
MaxIterValue = 210; %# maximum iterations
ErrVsIter = zeros(MaxIterValue, 2); %# store error data
%# Control maximum iterations
for N = 200: MaxIterValue
% options.MaxIter = N;
option = statset('MaxIter', N);
%# 5-fold Cross-validation
k = 5;
cvFolds = crossvalind('Kfold', Groups, k); %# get indices of 5-fold CV
cp = classperf(Groups); %# init performance tracker
for i = 1:k %# for each fold
testIdx = (cvFolds == i); %# get indices of test instances
trainIdx = ~testIdx; %# get indices training instances
%# train an SVM model over training instances
svmModel = svmtrain(meas(trainIdx,:), Groups(trainIdx), ...
'options',option, 'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'kernel_function','linear');
%#plotperform(svmModel);
%# test using test instances
pred = svmclassify(svmModel, meas(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
%# get error rate
ErrVsIter(N, 1) = N;
ErrVsIter(N, 2) = cp.ErrorRate;
end
plot(ErrVsIter(1:MaxIterValue,1),ErrVsIter(1:MaxIterValue,2));
You do it all correct, the problem is SVM is finding solution every time! So each epoch has CorrectRate=1, try and type cp.CorrectRate in your codes to see it.
The problem is in below line:
Groups = ismember(species,'setosa');
The data is so simple for SVM to solve.
and also plot it like this:
plot(ErrVsIter(200:MaxIterValue,1),ErrVsIter(200:MaxIterValue,2));

Calculate cross validation for Generalized Linear Model in Matlab

I am doing a regression using Generalized Linear Model.I am caught offguard using the crossVal function. My implementation so far;
x = 'Some dataset, containing the input and the output'
X = x(:,1:7);
Y = x(:,8);
cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);
mdl = GeneralizedLinearModel.fit(Xtrain,Ytrain,'linear','distr','poisson');
Ypred = predict(mdl,Xtest);
res = (Ypred - Ytest);
RMSE_test = sqrt(mean(res.^2));
The code below is for calculating cross validation for mulitple regression as obtained from this link. I want something similar for Generalized Linear Model.
c = cvpartition(Y,'k',10);
regf=#(Xtrain,Ytrain,Xtest)(Xtest*regress(Ytrain,Xtrain));
cvMse = crossval('mse',X,Y,'predfun',regf)
You can either perform the cross-validation process manually (training a model for each fold, predict outcome, compute error, then report the average across all folds), or you can use the CROSSVAL function which wraps this whole procedure in a single call.
To give an example, I will first load and prepare a dataset (a subset of the cars dataset which ships with the Statistics Toolbox):
% load regression dataset
load carsmall
X = [Acceleration Cylinders Displacement Horsepower Weight];
Y = MPG;
% remove instances with missing values
missIdx = isnan(Y) | any(isnan(X),2);
X(missIdx,:) = [];
Y(missIdx) = [];
clearvars -except X Y
Option 1
Here we will manually partition the data using k-fold cross-validation using cvpartition (non-stratified). For each fold, we train a GLM model using the training data, then use the model to predict output of testing data. Next we compute and store the regression mean squared error for this fold. At the end, we report the average RMSE across all partitions.
% partition data into 10 folds
K = 10;
cv = cvpartition(numel(Y), 'kfold',K);
mse = zeros(K,1);
for k=1:K
% training/testing indices for this fold
trainIdx = cv.training(k);
testIdx = cv.test(k);
% train GLM model
mdl = GeneralizedLinearModel.fit(X(trainIdx,:), Y(trainIdx), ...
'linear', 'Distribution','poisson');
% predict regression output
Y_hat = predict(mdl, X(testIdx,:));
% compute mean squared error
mse(k) = mean((Y(testIdx) - Y_hat).^2);
end
% average RMSE across k-folds
avrg_rmse = mean(sqrt(mse))
Option 2
Here we can simply call CROSSVAL with an appropriate function handle which computes the regression output given a set of train/test instances. See the doc page to understand the parameters.
% prediction function given training/testing instances
fcn = #(Xtr, Ytr, Xte) predict(...
GeneralizedLinearModel.fit(Xtr,Ytr,'linear','distr','poisson'), ...
Xte);
% perform cross-validation, and return average MSE across folds
mse = crossval('mse', X, Y, 'Predfun',fcn, 'kfold',10);
% compute root mean squared error
avrg_rmse = sqrt(mse)
You should get a similar result compared to before (slightly different of course, on account of the randomness involved in the cross-validation).

How can I get predicted values in SVM using MATLAB?

I am trying to get a prediction column matrix in MATLAB but I don't quite know how to go about coding it. My current code is -
load DataWorkspace.mat
groups = ismember(Num,'Yes');
k=10;
%# number of cross-validation folds:
%# If you have 50 samples, divide them into 10 groups of 5 samples each,
%# then train with 9 groups (45 samples) and test with 1 group (5 samples).
%# This is repeated ten times, with each group used exactly once as a test set.
%# Finally the 10 results from the folds are averaged to produce a single
%# performance estimation.
cvFolds = crossvalind('Kfold', groups, k);
cp = classperf(groups);
for i = 1:k
testIdx = (cvFolds == i);
trainIdx = ~testIdx;
svmModel = svmtrain(Data(trainIdx,:), groups(trainIdx), ...
'Autoscale',true, 'Showplot',false, 'Method','SMO', ...
'Kernel_Function','rbf');
pred = svmclassify(svmModel, Data(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
cp.CorrectRate
cp.CountingMatrix
The issue is that it's actually calculating the accuracy 11 times in total - 10 times for each fold and one final time as an average. But if I take the individual predictions of each fold and print pred for each loop, the accuracy understandable reduces greatly.
However, I need a column matrix of the predicted values for each row of the data. Any ideas on how I can go about modifying the code?
The whole idea of cross-validation is get an unbiased estimate of the performance of a classifier.
Once that done, you usually just train a model over the entire data. This model will be used to predict future instances.
So just do:
svmModel = svmtrain(Data, groups, ...);
pred = svmclassify(svmModel, otherData, ...);

How can I modify my code to show training and testing graphs in MATLAB?

I have this code about neural networks. How can I modify this code so that it can show the training and testing graphs?
%~~~~~~~~~~~[L1 L2 1];first hidden layer,second & output layer~~~~~
layer = [11 15 1];
myepochs = 30;
attemption = 1; %i;
mytfn = {'tansig' 'tansig' 'purelin'};
%~~~~~~load data~~~~~~~~~~~~~~~~~~~~~~~
m = xlsread('C:\Documents and Settings\winxp\My Documents\MATLAB\MATLAB_DATA\datatrain.csv');
%~~~~~~convert the data in Matrix form~~~~
[row,col] = size(m);
P = m(1:row,1:10)';
T1 = m(1:row, col)'; % target data for training...last column
net = newff([minmax(P)],layer,mytfn,'trainlm'); %nnet
net.trainParam.epochs = myepochs; % how many time newff will repeat the training
net.trainParam.showWindow = true;
net.trainParam.showCommandLine = true;
net = train(net,P,T1); % start training newff with input P and target T1
Y = sim(net,P); % training
save 'net114' net;
Also, is this code correct? I want to calculate the area and the perimeter of an image. But the calculated values show that perimeter is bigger than area which does not make sense, right? Or maybe maybe there's an explanation for that?
BW =~c;
area= bwarea(BW);
area
imshow(BW);
bw2=~c;
pm=bwperim(bw2);
perimeter=bwarea(pm);
You might want to try something like net.trainParam.show = 30 to show the training progress every 30 epochs.