Cross-Validation with libsvm to find best parameters - matlab

In order to find the best parameters to be used with libsvm I used the code below. Instead of './heart_scale' I had a file containing positive and negative examples each with a hog vector in libsvm format. I had 1000 positive examples and 4000 negative. However these were put in order, i.e. the 1st 1000 examples were positive examples and the others were negative.
Question: Now, I came in doubt whether the accuracy returned by this code is actual accuracy. This is because when I read on 5 fold cross-validation, it takes the first 4/5 of the data as training and the 1/5 left for testing. Does this mean that it can be the case the testing set is all negative? Or it takes the examples randomly please?
%# read some training data
[labels,data] = libsvmread('./heart_scale');
%# grid of parameters
folds = 5;
[C,gamma] = meshgrid(-5:2:15, -15:2:3);
%# grid search, and cross-validation
cv_acc = zeros(numel(C),1);
for i=1:numel(C)
cv_acc(i) = svmtrain(labels, data, ...
sprintf('-c %f -g %f -v %d', 2^C(i), 2^gamma(i), folds));
end
%# pair (C,gamma) with best accuracy
[~,idx] = max(cv_acc);
%# contour plot of paramter selection
contour(C, gamma, reshape(cv_acc,size(C))), colorbar
hold on
plot(C(idx), gamma(idx), 'rx')
text(C(idx), gamma(idx), sprintf('Acc = %.2f %%',cv_acc(idx)), ...
'HorizontalAlign','left', 'VerticalAlign','top')
hold off
xlabel('log_2(C)'), ylabel('log_2(\gamma)'), title('Cross-Validation Accuracy')
%# now you can train you model using best_C and best_gamma
best_C = 2^C(idx);
best_gamma = 2^gamma(idx);
%# ...

You can find answer to your question in the LIBSVM source code.
See the function svm_cross_validation in the svm.cpp.
As you can see, for classification cross-validation problem LIBSVM firstly performs class grouping and than shuffling.
So, answer to your question: yes, the accuracy returned by this code is actual accuracy.
Note: the accuracy estimation depends also on data nature, cross-validation folds number and itself is a random value with some distribution.

Related

Cross Validation Using libsvm

I am currently performing 5 - fold cross validation where I am using this code :
%# read some training data
[labels,data] = libsvmread('Training_Data_libsvmFormat.txt');
%# grid of parameters
folds = 5;
[C,gamma] = meshgrid(-5:2:15, -15:2:3) %Coarse Grid Search: bestC = 8 bestGamma = 2
%[C,gamma] = meshgrid(1:0.5:4, -1:0.25:3) %Fine Grid Search: bestC = 4 bestGamma = 2
%# grid search, and cross-validation
cv_acc = zeros(numel(C),1);
for i=1:numel(C)
cv_acc(i) = svmtrain(labels, data, sprintf('-c %f -g %f -v %d', 2^C(i), 2^gamma(i), folds));
end
%# pair (C,gamma) with best accuracy
[~,idx] = max(cv_acc);
%# contour plot of paramter selection
contour(C, gamma, reshape(cv_acc,size(C))), colorbar
hold on
plot(C(idx), gamma(idx), 'rx')
text(C(idx), gamma(idx), sprintf('Acc = %.2f %%',cv_acc(idx)), 'HorizontalAlign','left', 'VerticalAlign','top')
hold off
xlabel('log_2(C)'), ylabel('log_2(\gamma)'), title('Cross-Validation Accuracy')
%# now you can train you model using best_C and best_gamma
best_C = 2^C(idx);
best_gamma = 2^gamma(idx);
Now, I know that in 5 fold cross validation, 4/5 of dataset are used for training and 1/5 for testing and all the time changing the testing part to obtain the best cross C and gamma for RBF. However, in the dataset the 1st 1000 examples are positive while the last 3000 are all negative. Does cross validation using svmtrain() shuffle the data or it may be the case that the 1/5 for testing contains all negative examples please? I am asking this question as if it does not shuffle the data, the accuracy is not realistic.
I appreciate you assistance.

How to get the error vs. epochs (iterations) plot in matlab when using svm classification?

I use svmtrain to train my data set and svmclassify to predict test set. I want to look at the optimization process, the error vs. epochs (iterations) plot. I look into the usage and the code and find out that there are no information regarding such problem. The only thing I can get is control of the Maximum Iteration.
How to get the error vs. epochs (iterations) plot in matlab when using SVM classification?
Here is the code I modified. But not the one I want, I want the error at each epoch. Anybody did such analysis before? Thank you.
Best regards!
%# load dataset
load fisheriris %# load iris dataset
Groups = ismember(species,'setosa'); %# create a two-class problem
MaxIterValue = 210; %# maximum iterations
ErrVsIter = zeros(MaxIterValue, 2); %# store error data
%# Control maximum iterations
for N = 200: MaxIterValue
% options.MaxIter = N;
option = statset('MaxIter', N);
%# 5-fold Cross-validation
k = 5;
cvFolds = crossvalind('Kfold', Groups, k); %# get indices of 5-fold CV
cp = classperf(Groups); %# init performance tracker
for i = 1:k %# for each fold
testIdx = (cvFolds == i); %# get indices of test instances
trainIdx = ~testIdx; %# get indices training instances
%# train an SVM model over training instances
svmModel = svmtrain(meas(trainIdx,:), Groups(trainIdx), ...
'options',option, 'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'kernel_function','linear');
%#plotperform(svmModel);
%# test using test instances
pred = svmclassify(svmModel, meas(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
%# get error rate
ErrVsIter(N, 1) = N;
ErrVsIter(N, 2) = cp.ErrorRate;
end
plot(ErrVsIter(1:MaxIterValue,1),ErrVsIter(1:MaxIterValue,2));
You do it all correct, the problem is SVM is finding solution every time! So each epoch has CorrectRate=1, try and type cp.CorrectRate in your codes to see it.
The problem is in below line:
Groups = ismember(species,'setosa');
The data is so simple for SVM to solve.
and also plot it like this:
plot(ErrVsIter(200:MaxIterValue,1),ErrVsIter(200:MaxIterValue,2));

How to fix the fisheriris cross classification

I tried to run this code found online, but it does not work. The error is
Error using svmclassify (line 53)
The first input should be a `struct` generated by `SVMTRAIN`.
Error in fisheriris_classification (line 27)
pred = svmclassify(svmModel, meas(testIdx,:), 'Showplot',false);
Can anyone help me fix this problem? Thank you so much!
clear all;
close all;
load fisheriris %# load iris dataset
groups = ismember(species,'setosa'); %# create a two-class problem
%# number of cross-validation folds:
%# If you have 50 samples, divide them into 10 groups of 5 samples each,
%# then train with 9 groups (45 samples) and test with 1 group (5 samples).
%# This is repeated ten times, with each group used exactly once as a test set.
%# Finally the 10 results from the folds are averaged to produce a single
%# performance estimation.
k=10;
cvFolds = crossvalind('Kfold', groups, k); %# get indices of 10-fold CV
cp = classperf(groups); %# init performance tracker
for i = 1:k %# for each fold
testIdx = (cvFolds == i); %# get indices of test instances
trainIdx = ~testIdx; %# get indices training instances
%# train an SVM model over training instances
svmModel = svmtrain(meas(trainIdx,:), groups(trainIdx), ...
'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'Kernel_Function','rbf', 'RBF_Sigma',1);
%# test using test instances
pred = svmclassify(svmModel, meas(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
%# get accuracy
cp.CorrectRate
%# get confusion matrix
%# columns:actual, rows:predicted, last-row: unclassified instances
cp.CountingMatrix
%with the output:
%ans =
% 0.99333
%ans =
% 100 1
% 0 49
% 0 0
The reason for the issue seems to me the way MATLAB finds functions on the search path. I am fairly certain that it is still attempting to use the LIBSVM function rather than the built-in MATLAB function. Here is more information about the search path:
http://www.mathworks.com/help/matlab/matlab_env/what-is-the-matlab-search-path.html
To verify whether this is the issue, please try the following command in the command window:
>> which -all svmtrain
You should find that the built-in function is being shadowed by the LIBSVM function. You can either remove LIBSVM from the MATLAB search path using the "Set Path" tool in the Toolstrip, or run your code from a different directory that does not contain the LIBSVM files. I would recommend the first option. To read more about the built-in MATLAB functions, check these links:
http://www.mathworks.com/help/stats/svmtrain.html
http://www.mathworks.com/help/stats/svmclassify.html
If you would like to continue use LIBSVM, I would recommend checking the following site out.
https://www.csie.ntu.edu.tw/~cjlin/index.html
Hope this helps.

Multivariate Linear Regression in MATLAB

I already have my data prepared in terms of:
p1=input1 %load of today current hour
p2=input2 %load of today past one hour
p3=input3 $load of today past two hours
a1=output %load of next day current hour
I have the following code below:
%Input Set 1 For Weekday Load(d+1,t)
%(d,t),(d,t-1), (d,t-2)
L=xlsread('input_set1_weekday.xlsx',1); %2011
k=1;
size(L,1);
for a=5:2:size(L,1)-48 % L load for 2011
P(1,k)= L(a,1);
P(2,k)= L(a-2,1);
P(3,k)= L(a-4,1);
P(4,k)= L(a+48,1);
k=k+1;
end
I have my data arranged in such a way that in every column, p1, p2, p3 are my predictor variables and a1 is my response variable.
How do I now fit a linear model to this set of data to check the performance of my predictions? By the way it is electrical load forecasting model.
My other doubt is that in the examples shown by most of the sources, they use the last column data as response variable and this is the part I'm struggling with.
fitlm will be able to do this for you quite nicely. You use fitlm to train a linear regression model, so you provide it the predictors as well as the responses. Once you do this, you can then use predict to predict the new responses based on new predictors that you put in.
The basic way for you to call this is:
lmModel = fitlm(X, y, 'linear', 'RobustOpts', 'on');
X is a data matrix where each column is a predictor and each row is an observation. Therefore, you would have to transpose your matrix before running this function. Basically, you would do P(1:3,:).' as you only want the first three rows (now columns) of your data. y would be your output values for each observation and this is a column vector that has the same number of rows as your observations. Regarding your comment about using the "last" column as the response vector, you don't have to do this at all. You specify your response vector in a completely separate input variable, which is y. As such, your a1 would serve here, while your predictors and observations would be stored in X. You can totally place your response vector as a column in your matrix; you would just have to subset it accordingly.
As such, y would be your a1 variable, and make sure it's a column vector, and so you can do this a1(:) to be sure. The linear flag specifies linear regression, but that is the default flag anyway. RobustOpts is recommended so that you can perform robust linear regression. For your case, you would have to call fitlm this way:
lmModel = fitlm(P(1:3,:).', a1(:), 'linear', 'RobustOpts', 'on');
Now to predict new responses, you would do:
ypred = predict(lmModel, Xnew);
Xnew would be your new observations that follow the same style as X. You have to have the same number of columns as X, but you can have as many rows as you want. The output ypred will give you the predicted response for each observation of X that you have. As an example, let's use a dataset that is built into MATLAB, split up the data into a training and test data set, fit a model with the training set, then use the test dataset and see what the predicted responses are. Let's split up the data so that it's a 75% / 25% ratio. We will use the carsmall dataset which contains 100 observations for various cars and have descriptors such as Weight, Displacement, Model... typically used to describe cars. We will use Weight, Cylinders and Acceleration as the predictor variables, and let's try and predict the miles per gallon MPG as our outcome. Once I do this, let's calculate the difference between the predicted values and the true values and compare between them. As such:
load carsmall; %// Load in dataset
%// Build predictors and outcome
X = [Weight Cylinders Acceleration];
y = MPG;
%// Set seed for reproducibility
rng(1234);
%// Generate training and test data sets
%// Randomly select 75 observations for the training
%// dataset. First generate the indices to select the data
indTrain = randperm(100, 75);
%// The above may generate an error if you have anything below R2012a
%// As such, try this if the above doesn't work
%//indTrain = randPerm(100);
%//indTrain = indTrain(1:75);
%// Get those indices that haven't been selected as the test dataset
indTest = 1 : 100;
indTest(indTrain) = [];
%// Now build our test and training data
trainX = X(indTrain, :);
trainy = y(indTrain);
testX = X(indTest, :);
testy = y(indTest);
%// Fit linear model
lmModel = fitlm(trainX, trainy, 'linear', 'RobustOpts', 'on');
%// Now predict
ypred = predict(lmModel, testX);
%// Show differences between predicted and true test output
diffPredict = abs(ypred - testy);
This is what happens when you echo out what the linear model looks like:
lmModel =
Linear regression model (robust fit):
y ~ 1 + x1 + x2 + x3
Estimated Coefficients:
Estimate SE tStat pValue
__________ _________ _______ __________
(Intercept) 52.495 3.7425 14.027 1.7839e-21
x1 -0.0047557 0.0011591 -4.1031 0.00011432
x2 -2.0326 0.60512 -3.359 0.0013029
x3 -0.26011 0.1666 -1.5613 0.12323
Number of observations: 70, Error degrees of freedom: 66
Root Mean Squared Error: 3.64
R-squared: 0.788, Adjusted R-Squared 0.778
F-statistic vs. constant model: 81.7, p-value = 3.54e-22
This all comes from statistical analysis, but for a novice, what matters are the p-values for each of our predictors. The smaller the p-value, the more suitable this predictor is for your model. You can see that the first two predictors: Weight and Cylinders are a good representation on determining the MPG. Acceleration... not so much. What this means is that this variable is not a meaningful predictor to use, so you should probably use something else. In fact, if you were to remove this predictor and retrain your model, you would most likely see that the predicted values would closely match those where the Acceleration was included.
This is a truly bastardized version of interpreting p-values and so I defer you to an actual regression models or statistics course for more details.
This is what we have predicted the values to be, given our test set and beside it what the true values are:
>> [ypred testy]
ans =
17.0324 18.0000
12.9886 15.0000
13.1869 14.0000
14.1885 NaN
16.9899 14.0000
29.1824 24.0000
23.0753 18.0000
28.6148 28.0000
28.2572 25.0000
29.0365 26.0000
20.5819 22.0000
18.3324 20.0000
20.4845 17.5000
22.3334 19.0000
12.2569 16.5000
13.9280 13.0000
14.7350 13.0000
26.6757 27.0000
30.9686 36.0000
30.4179 31.0000
29.7588 36.0000
30.6631 38.0000
28.2995 26.0000
22.9933 22.0000
28.0751 32.0000
The fourth actual output value from the test data set is NaN, which denotes that the value is missing. However, when we run our the observation corresponding to this output value into our linear model, it predicts a value anyway which is to be expected. You have other observations to help train the model that when using this observation to find a prediction, it would naturally draw from those other observations.
When we compute the difference between these two, we get:
diffPredict =
0.9676
2.0114
0.8131
NaN
2.9899
5.1824
5.0753
0.6148
3.2572
3.0365
1.4181
1.6676
2.9845
3.3334
4.2431
0.9280
1.7350
0.3243
5.0314
0.5821
6.2412
7.3369
2.2995
0.9933
3.9249
As you can see, there are some instances where the prediction was quite close, and others where the prediction was far from the truth.... it's the crux of any prediction algorithm really. You'll have to play around with what predictors you want, as well as playing with the options with your training. Have a look at the fitlm documentation for more details on what you can play around with.
Edit - July 30th, 2014
As you don't have fitlm, you can easily use LinearModel.fit. You would call it with the same inputs like fitlm. As such:
lmModel = LinearModel.fit(trainX, trainy, 'linear', 'RobustOpts', 'on');
This should give you exactly the same results. predict should exist pre-R2014a, so that should be available to you.
Good luck!

How can I get predicted values in SVM using MATLAB?

I am trying to get a prediction column matrix in MATLAB but I don't quite know how to go about coding it. My current code is -
load DataWorkspace.mat
groups = ismember(Num,'Yes');
k=10;
%# number of cross-validation folds:
%# If you have 50 samples, divide them into 10 groups of 5 samples each,
%# then train with 9 groups (45 samples) and test with 1 group (5 samples).
%# This is repeated ten times, with each group used exactly once as a test set.
%# Finally the 10 results from the folds are averaged to produce a single
%# performance estimation.
cvFolds = crossvalind('Kfold', groups, k);
cp = classperf(groups);
for i = 1:k
testIdx = (cvFolds == i);
trainIdx = ~testIdx;
svmModel = svmtrain(Data(trainIdx,:), groups(trainIdx), ...
'Autoscale',true, 'Showplot',false, 'Method','SMO', ...
'Kernel_Function','rbf');
pred = svmclassify(svmModel, Data(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
cp.CorrectRate
cp.CountingMatrix
The issue is that it's actually calculating the accuracy 11 times in total - 10 times for each fold and one final time as an average. But if I take the individual predictions of each fold and print pred for each loop, the accuracy understandable reduces greatly.
However, I need a column matrix of the predicted values for each row of the data. Any ideas on how I can go about modifying the code?
The whole idea of cross-validation is get an unbiased estimate of the performance of a classifier.
Once that done, you usually just train a model over the entire data. This model will be used to predict future instances.
So just do:
svmModel = svmtrain(Data, groups, ...);
pred = svmclassify(svmModel, otherData, ...);