How can I get predicted values in SVM using MATLAB? - matlab

I am trying to get a prediction column matrix in MATLAB but I don't quite know how to go about coding it. My current code is -
load DataWorkspace.mat
groups = ismember(Num,'Yes');
k=10;
%# number of cross-validation folds:
%# If you have 50 samples, divide them into 10 groups of 5 samples each,
%# then train with 9 groups (45 samples) and test with 1 group (5 samples).
%# This is repeated ten times, with each group used exactly once as a test set.
%# Finally the 10 results from the folds are averaged to produce a single
%# performance estimation.
cvFolds = crossvalind('Kfold', groups, k);
cp = classperf(groups);
for i = 1:k
testIdx = (cvFolds == i);
trainIdx = ~testIdx;
svmModel = svmtrain(Data(trainIdx,:), groups(trainIdx), ...
'Autoscale',true, 'Showplot',false, 'Method','SMO', ...
'Kernel_Function','rbf');
pred = svmclassify(svmModel, Data(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
cp.CorrectRate
cp.CountingMatrix
The issue is that it's actually calculating the accuracy 11 times in total - 10 times for each fold and one final time as an average. But if I take the individual predictions of each fold and print pred for each loop, the accuracy understandable reduces greatly.
However, I need a column matrix of the predicted values for each row of the data. Any ideas on how I can go about modifying the code?

The whole idea of cross-validation is get an unbiased estimate of the performance of a classifier.
Once that done, you usually just train a model over the entire data. This model will be used to predict future instances.
So just do:
svmModel = svmtrain(Data, groups, ...);
pred = svmclassify(svmModel, otherData, ...);

Related

Matlab cross-validation on images with multiple class SVM

I am trying to perform cross-validation on images for my SVM, where I have 3 categories of labels for the classification, "Good", "Ok" and "Bad".
For my data set, I have a 120 * 20 cell array, mainly 19 columns of features and with the last column being the class label for 120 distinct images.
The SVM train is performed using 2 different train label as such
SVMStruct = svmtrain(normalizedTrainingSet , train_label, 'kernel_function', 'linear');
SVMStruct1 = svmtrain(normalizedTrainingSet , train_label1, 'kernel_function', 'linear');
Where "normalizedTrainingSet" is the numeric matrix for my data set. train_label is the label for Bad vs Normal&Good; train_label1 is the label for Good vs Normal&Bad, and I performed some if else statements to sort them out.
I want to perform cross-validation with 5 folds, and during each fold, I want to split the images equally for each category. For example, 4 for testing, and 16 for training during each fold, equally for all 3 categories.
Below is my code for the cross-validation.
K = 5; % The number of folds
N = size(DataSet, 1);
idx = crossvalind('Kfold', N, K);
cp = classperf(train_label3); %train_label3 is the combination of all 3 categories in one array.
for i = 1:K
Data_Set = DataSet(idx ~= i, :); % data to train on, 90% of the total.
training_label = train_label3(idx ~= i, :); % class labels of training data.
Test_Set = DataSet(idx == i, :); % data to test on, 10% of the total.
testing_label = train_label3(idx == i, :); % class labels of test data.
I am stucked trying to perform the cross-validation and need some help on how to continue.

How to get the error vs. epochs (iterations) plot in matlab when using svm classification?

I use svmtrain to train my data set and svmclassify to predict test set. I want to look at the optimization process, the error vs. epochs (iterations) plot. I look into the usage and the code and find out that there are no information regarding such problem. The only thing I can get is control of the Maximum Iteration.
How to get the error vs. epochs (iterations) plot in matlab when using SVM classification?
Here is the code I modified. But not the one I want, I want the error at each epoch. Anybody did such analysis before? Thank you.
Best regards!
%# load dataset
load fisheriris %# load iris dataset
Groups = ismember(species,'setosa'); %# create a two-class problem
MaxIterValue = 210; %# maximum iterations
ErrVsIter = zeros(MaxIterValue, 2); %# store error data
%# Control maximum iterations
for N = 200: MaxIterValue
% options.MaxIter = N;
option = statset('MaxIter', N);
%# 5-fold Cross-validation
k = 5;
cvFolds = crossvalind('Kfold', Groups, k); %# get indices of 5-fold CV
cp = classperf(Groups); %# init performance tracker
for i = 1:k %# for each fold
testIdx = (cvFolds == i); %# get indices of test instances
trainIdx = ~testIdx; %# get indices training instances
%# train an SVM model over training instances
svmModel = svmtrain(meas(trainIdx,:), Groups(trainIdx), ...
'options',option, 'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'kernel_function','linear');
%#plotperform(svmModel);
%# test using test instances
pred = svmclassify(svmModel, meas(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
%# get error rate
ErrVsIter(N, 1) = N;
ErrVsIter(N, 2) = cp.ErrorRate;
end
plot(ErrVsIter(1:MaxIterValue,1),ErrVsIter(1:MaxIterValue,2));
You do it all correct, the problem is SVM is finding solution every time! So each epoch has CorrectRate=1, try and type cp.CorrectRate in your codes to see it.
The problem is in below line:
Groups = ismember(species,'setosa');
The data is so simple for SVM to solve.
and also plot it like this:
plot(ErrVsIter(200:MaxIterValue,1),ErrVsIter(200:MaxIterValue,2));

How to fix the fisheriris cross classification

I tried to run this code found online, but it does not work. The error is
Error using svmclassify (line 53)
The first input should be a `struct` generated by `SVMTRAIN`.
Error in fisheriris_classification (line 27)
pred = svmclassify(svmModel, meas(testIdx,:), 'Showplot',false);
Can anyone help me fix this problem? Thank you so much!
clear all;
close all;
load fisheriris %# load iris dataset
groups = ismember(species,'setosa'); %# create a two-class problem
%# number of cross-validation folds:
%# If you have 50 samples, divide them into 10 groups of 5 samples each,
%# then train with 9 groups (45 samples) and test with 1 group (5 samples).
%# This is repeated ten times, with each group used exactly once as a test set.
%# Finally the 10 results from the folds are averaged to produce a single
%# performance estimation.
k=10;
cvFolds = crossvalind('Kfold', groups, k); %# get indices of 10-fold CV
cp = classperf(groups); %# init performance tracker
for i = 1:k %# for each fold
testIdx = (cvFolds == i); %# get indices of test instances
trainIdx = ~testIdx; %# get indices training instances
%# train an SVM model over training instances
svmModel = svmtrain(meas(trainIdx,:), groups(trainIdx), ...
'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'Kernel_Function','rbf', 'RBF_Sigma',1);
%# test using test instances
pred = svmclassify(svmModel, meas(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
%# get accuracy
cp.CorrectRate
%# get confusion matrix
%# columns:actual, rows:predicted, last-row: unclassified instances
cp.CountingMatrix
%with the output:
%ans =
% 0.99333
%ans =
% 100 1
% 0 49
% 0 0
The reason for the issue seems to me the way MATLAB finds functions on the search path. I am fairly certain that it is still attempting to use the LIBSVM function rather than the built-in MATLAB function. Here is more information about the search path:
http://www.mathworks.com/help/matlab/matlab_env/what-is-the-matlab-search-path.html
To verify whether this is the issue, please try the following command in the command window:
>> which -all svmtrain
You should find that the built-in function is being shadowed by the LIBSVM function. You can either remove LIBSVM from the MATLAB search path using the "Set Path" tool in the Toolstrip, or run your code from a different directory that does not contain the LIBSVM files. I would recommend the first option. To read more about the built-in MATLAB functions, check these links:
http://www.mathworks.com/help/stats/svmtrain.html
http://www.mathworks.com/help/stats/svmclassify.html
If you would like to continue use LIBSVM, I would recommend checking the following site out.
https://www.csie.ntu.edu.tw/~cjlin/index.html
Hope this helps.

Calculate cross validation for Generalized Linear Model in Matlab

I am doing a regression using Generalized Linear Model.I am caught offguard using the crossVal function. My implementation so far;
x = 'Some dataset, containing the input and the output'
X = x(:,1:7);
Y = x(:,8);
cvpart = cvpartition(Y,'holdout',0.3);
Xtrain = X(training(cvpart),:);
Ytrain = Y(training(cvpart),:);
Xtest = X(test(cvpart),:);
Ytest = Y(test(cvpart),:);
mdl = GeneralizedLinearModel.fit(Xtrain,Ytrain,'linear','distr','poisson');
Ypred = predict(mdl,Xtest);
res = (Ypred - Ytest);
RMSE_test = sqrt(mean(res.^2));
The code below is for calculating cross validation for mulitple regression as obtained from this link. I want something similar for Generalized Linear Model.
c = cvpartition(Y,'k',10);
regf=#(Xtrain,Ytrain,Xtest)(Xtest*regress(Ytrain,Xtrain));
cvMse = crossval('mse',X,Y,'predfun',regf)
You can either perform the cross-validation process manually (training a model for each fold, predict outcome, compute error, then report the average across all folds), or you can use the CROSSVAL function which wraps this whole procedure in a single call.
To give an example, I will first load and prepare a dataset (a subset of the cars dataset which ships with the Statistics Toolbox):
% load regression dataset
load carsmall
X = [Acceleration Cylinders Displacement Horsepower Weight];
Y = MPG;
% remove instances with missing values
missIdx = isnan(Y) | any(isnan(X),2);
X(missIdx,:) = [];
Y(missIdx) = [];
clearvars -except X Y
Option 1
Here we will manually partition the data using k-fold cross-validation using cvpartition (non-stratified). For each fold, we train a GLM model using the training data, then use the model to predict output of testing data. Next we compute and store the regression mean squared error for this fold. At the end, we report the average RMSE across all partitions.
% partition data into 10 folds
K = 10;
cv = cvpartition(numel(Y), 'kfold',K);
mse = zeros(K,1);
for k=1:K
% training/testing indices for this fold
trainIdx = cv.training(k);
testIdx = cv.test(k);
% train GLM model
mdl = GeneralizedLinearModel.fit(X(trainIdx,:), Y(trainIdx), ...
'linear', 'Distribution','poisson');
% predict regression output
Y_hat = predict(mdl, X(testIdx,:));
% compute mean squared error
mse(k) = mean((Y(testIdx) - Y_hat).^2);
end
% average RMSE across k-folds
avrg_rmse = mean(sqrt(mse))
Option 2
Here we can simply call CROSSVAL with an appropriate function handle which computes the regression output given a set of train/test instances. See the doc page to understand the parameters.
% prediction function given training/testing instances
fcn = #(Xtr, Ytr, Xte) predict(...
GeneralizedLinearModel.fit(Xtr,Ytr,'linear','distr','poisson'), ...
Xte);
% perform cross-validation, and return average MSE across folds
mse = crossval('mse', X, Y, 'Predfun',fcn, 'kfold',10);
% compute root mean squared error
avrg_rmse = sqrt(mse)
You should get a similar result compared to before (slightly different of course, on account of the randomness involved in the cross-validation).

KNN algo in matlab

I am working on thumb recognition system. I need to implement KNN algorithm to classify my images. according to this, it has only 2 measurements, through which it is calculating the distance to find the nearest neighbour but in my case I have 400 images of 25 X 42, in which 200 are for training and 200 for testing. I am searching for few hours but I am not finding the way to find the distance between the points.
EDIT:
I have reshaped 1st 200 images in to 1 X 1050 and stored them in a matrix trainingData of 200 X 1050. similarly I made testingData.
Here is an illustration code for k-nearest neighbor classification (some functions used require the Statistics toolbox):
%# image size
sz = [25,42];
%# training images
numTrain = 200;
trainData = zeros(numTrain,prod(sz));
for i=1:numTrain
img = imread( sprintf('train/image_%03d.jpg',i) );
trainData(i,:) = img(:);
end
%# testing images
numTest = 200;
testData = zeros(numTest,prod(sz));
for i=1:numTest
img = imread( sprintf('test/image_%03d.jpg',i) );
testData(i,:) = img(:);
end
%# target class (I'm just using random values. Load your actual values instead)
trainClass = randi([1 5], [numTrain 1]);
testClass = randi([1 5], [numTest 1]);
%# compute pairwise distances between each test instance vs. all training data
D = pdist2(testData, trainData, 'euclidean');
[D,idx] = sort(D, 2, 'ascend');
%# K nearest neighbors
K = 5;
D = D(:,1:K);
idx = idx(:,1:K);
%# majority vote
prediction = mode(trainClass(idx),2);
%# performance (confusion matrix and classification error)
C = confusionmat(testClass, prediction);
err = sum(C(:)) - sum(diag(C))
If you want to compute the Euclidean distance between vectors a and b, just use Pythagoras. In Matlab:
dist = sqrt(sum((a-b).^2));
However, you might want to use pdist to compute it for all combinations of vectors in your matrix at once.
dist = squareform(pdist(myVectors, 'euclidean'));
I'm interpreting columns as instances to classify and rows as potential neighbors. This is arbitrary though and you could switch them around.
If have a separate test set, you can compute the distance to the instances in the training set with pdist2:
dist = pdist2(trainingSet, testSet, 'euclidean')
You can use this distance matrix to knn-classify your vectors as follows. I'll generate some random data to serve as example, which will result in low (around chance level) accuracy. But of course you should plug in your actual data and results will probably be better.
m = rand(nrOfVectors,nrOfFeatures); % random example data
classes = randi(nrOfClasses, 1, nrOfVectors); % random true classes
k = 3; % number of neighbors to consider, 3 is a common value
d = squareform(pdist(m, 'euclidean')); % distance matrix
[neighborvals, neighborindex] = sort(d,1); % get sorted distances
Take a look at the neighborvals and neighborindex matrices and see if they make sense to you. The first is a sorted version of the earlier d matrix, and the latter gives the corresponding instance numbers. Note that the self-distances (on the diagonal in d) have floated to the top. We're not interested in this (always zero), so we'll skip the top row in the next step.
assignedClasses = mode(neighborclasses(2:1+k,:),1);
So we assign the most common class among the k nearest neighbors!
You can compare the assigned classes with the actual classes to get an accuracy score:
accuracy = 100 * sum(classes == assignedClasses)/length(classes);
fprintf('KNN Classifier Accuracy: %.2f%%\n', 100*accuracy)
Or make a confusion matrix to see the distribution of classifications:
confusionmat(classes, assignedClasses)
yes, there is a function for knn : knnclassify
Play around with the number of neighbors you want to keep in order to get the best result (use a confusion matrix). This function takes care of the distance, of course.