I am trying to perform cross-validation on images for my SVM, where I have 3 categories of labels for the classification, "Good", "Ok" and "Bad".
For my data set, I have a 120 * 20 cell array, mainly 19 columns of features and with the last column being the class label for 120 distinct images.
The SVM train is performed using 2 different train label as such
SVMStruct = svmtrain(normalizedTrainingSet , train_label, 'kernel_function', 'linear');
SVMStruct1 = svmtrain(normalizedTrainingSet , train_label1, 'kernel_function', 'linear');
Where "normalizedTrainingSet" is the numeric matrix for my data set. train_label is the label for Bad vs Normal&Good; train_label1 is the label for Good vs Normal&Bad, and I performed some if else statements to sort them out.
I want to perform cross-validation with 5 folds, and during each fold, I want to split the images equally for each category. For example, 4 for testing, and 16 for training during each fold, equally for all 3 categories.
Below is my code for the cross-validation.
K = 5; % The number of folds
N = size(DataSet, 1);
idx = crossvalind('Kfold', N, K);
cp = classperf(train_label3); %train_label3 is the combination of all 3 categories in one array.
for i = 1:K
Data_Set = DataSet(idx ~= i, :); % data to train on, 90% of the total.
training_label = train_label3(idx ~= i, :); % class labels of training data.
Test_Set = DataSet(idx == i, :); % data to test on, 10% of the total.
testing_label = train_label3(idx == i, :); % class labels of test data.
I am stucked trying to perform the cross-validation and need some help on how to continue.
Related
I need to use a SVM to distinguish 28x28 matrices into 9 classes. There are 60,000 training inputs and 10,000 testing inputs.
My current program is as follows:
clear;
load mnist.mat
xtest = xtest ./ 255; <--- Normalizing the data
xtrain = xtrain ./ 255;
SVMModels = cell(9,1);
classes = unique(ytrain);
rng(1); % For reproducibility
blah = fitcsvm(xtrain, ytrain);
for j = 1:numel(classes);
indx = strcmp(ytrain,classes(j)); % Create binary classes for each classifier
SVMModels{j} = fitcsvm(xtrain,indx, 'KernelFunction','rbf','BoxConstraint',1);
end
I believe the problem is due either to the fact that the inputs are 28x28. How do I fix this?
Additional info:
xtest: 28x28x10000
ytest = 1x10000
xtrain = 28x28x60000
ytrain = 1x60000
You are correct. fitcsvm requires that the input training examples is a N x P matrix where N is the total number of samples and P is the total number of features. What you have to do in your case is reshape your array so that xtrain and xtest are 60000 x 784. The 784 is due to 28 x 28. Specifically, you must unroll each slice of your 3D matrix so that it fits into a single vector. Similarly the class labels must be N x 1, so you just need to transpose ytrain and ytest.
To achieve the desired reshaping, you use reshape like so:
xtrain_final = reshape(xtrain, 784, 60000).'; %'
xtest_final = reshape(xtest, 784, 60000).'; %'
ytrain_final = ytrain.'; %'
ytest_final = ytest.'; %'
Now the reshaping of the training and testing examples is a bit odd. How MATLAB works when reshaping is that it performs this on a column major basis. This means that when you reshape, it takes columns at a time to produce your results. As such, because your matrix is 28 x 28 x 60000, each slice of your 3D matrix is 28 x 28. Therefore, to facilitate the column major ordering, you take each 2D slice and fit it to a single column. You would thus have 60000 columns corresponding to 60000 training examples. The last thing you need to do now is transpose this result to get what is required for fitcsvm.
Now that this is done, you can train your model.
I want to use a 10-fold cross validation method, which tests which polynomial form (first, second, or
third order) gives a better fit. I want to divide my data set into 10 subsets and remove 1 subset from the 10 data sets. Derive a regression model without this subset, predict the output values for this subset using the derived regression model, and computed the residuals. Finally repeat the calculation routine for each subset and sum the squares of the resulting residuals.
I already coded the following on Matlab 2013b, which sample the data and test the regression on the training data. I am stuck on how to repeat this for every subset and how to compare which polynomial form gives a better fit.
% Sample the data
parm = [AT];
n = length(parm);
k = 10; % how many parts to use
allix = randperm(n); % all data indices, randomly ordered
numineach = ceil(n/k); % at least one part must have this many data points
allix = reshape([allix NaN(1,k*numineach-n)],k,numineach);
for p=1:k
testix = allix(p,:); % indices to use for testing
testix(isnan(testix)) = []; % remove NaNs if necessary
trainix = setdiff(1:n,testix); % indices to use for training
%train = parm(trainix); %gives the training data
%test = parm(testix); %gives the testing data
end
% Derive regression on the training data
Sal = Salinity(trainix);
Temp = Temperature(trainix);
At = parm(trainix);
xyz =[Sal Temp At];
% Fit a Polynomial Surface
surffit = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
% Shows equation, rsquare, rmse
[b,bint,r] = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
Regarding executing your code for every subset, you can put the fit inside the loop and store the results, e.g.
% Sample the data
parm = [AT];
n = length(parm);
k = 10; % how many parts to use
allix = randperm(n); % all data indices, randomly ordered
numineach = ceil(n/k); % at least one part must have this many data points
allix = reshape([allix NaN(1,k*numineach-n)],k,numineach);
bAll = []; bintAll = []; rAll = [];
for p=1:k
testix = allix(p,:); % indices to use for testing
testix(isnan(testix)) = []; % remove NaNs if necessary
trainix = setdiff(1:n,testix); % indices to use for training
%train = parm(trainix); %gives the training data
%test = parm(testix); %gives the testing data
% Derive regression on the training data
Sal = Salinity(trainix);
Temp = Temperature(trainix);
At = parm(trainix);
xyz =[Sal Temp At];
% Fit a Polynomial Surface
surffit = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
% Shows equation, rsquare, rmse
[b,bint,r] = fit([xyz(:,1), xyz(:,2)],xyz(:,3), 'poly11');
bAll = [bAll, coeffvalues(b)]; bintAll = [bintAll,bint]; rAll = [rAll,r];
end
Regarding the best fit, you probably can pick the fit with the lowest rmse.
I was toying around with the matlab neural network toolbox and I encountered some things I did not expect. My problem is especially with a classification network with no hidden layers, only 1 input and the tansig transfer function. So I expect this classifier to divide a 1D dataset at some point that is defined by the learned input weight and bias.
First of all, I thought that the formula for computing the output y for a given input x is:
y = f(x*w + b)
with w the input weight and b the bias. What is the correct formula for calculating the output of the network?
I also expected that translating the whole dataset by a certain value (+77) would have a big effect on the bias and/or weight. But this doesn't seem to be the case. Why does the translation of the dataset has not much effect on the bias and weight?
This is my code:
% Generate data
p1 = randn(1, 1000);
t1(1:1000) = 1;
p2 = 3 + randn(1, 1000);
t2(1:1000) = -1;
% view data
figure, hist([p1', p2'], 100)
P = [p1 p2];
T = [t1 t2];
% train network without hidden layer
net1 = newff(P, T, [], {'tansig'}, 'trainlm');
[net1,tr] = train(net1, P, T);
% display weight and bias
w = net1.IW{1,1};
b = net1.b{1,1};
disp(w) % -6.8971
disp(b) % -0.2280
% make label decision
class_correct = 0;
outputs = net1(P);
for i = 1:length(outputs)
% choose between -1 and 1
if outputs(i) > 0
outputs(i) = 1;
else
outputs(i) = -1;
end
% compare
if T(i) == outputs(i)
class_correct = class_correct + 1;
end
end
% Calculate the correct classification rate (CCR)
CCR = (class_correct * 100) / length(outputs);
fprintf('CCR: %f \n', CCR);
% plot the errors
errors = gsubtract(T, outputs);
figure, plot(errors)
% I expect these to be equal and near 1
net1(0) % 0.9521
tansig(0*w + b) % -0.4680
% I expect these to be equal and near -1
net1(4) % -0.9991
tansig(4*w + b) % -1
% translate the dataset by +77
P2 = P + 77;
% train network without hidden layer
net2 = newff(P2, T, [], {'tansig'}, 'trainlm');
[net2,tr] = train(net2, P2, T);
% display weight and bias
w2 = net2.IW{1,1};
b2 = net2.b{1,1};
disp(w2) % -5.1132
disp(b2) % -0.1556
I generated an artificial dataset that is made of 2 populations with a normal distribution with a different mean. I plotted these populations in a histogram, and trained the network with it.
I calculate the Correct classification rate, which is the percentage of the correct classified instances of the whole dataset. This is somewhere around 92% so I know the classifier works.
But, I expected net1(x) and tansig(x*w + b) to give the same output, this is not the case. What is the correct formula for calculating the output of my trained network?
And I expected net1 and net2 to have different weights and/or bias because net2 is trained on a translated version (+77) of the dataset where net1 is trained on. Why does the translation of the dataset has not much effect on the bias and weight?
First off, your code leaves the default MATLAB input pre-processing intact. You can check this with:
net2.inputs{1}
When I put your code in I got this:
Neural Network Input
name: 'Input'
feedbackOutput: []
processFcns: {'fixunknowns', removeconstantrows,
mapminmax}
processParams: {1x3 cell array of 2 params}
processSettings: {1x3 cell array of 3 settings}
processedRange: [1x2 double]
processedSize: 1
range: [1x2 double]
size: 1
userdata: (your custom info)
The important part being processFncs set to mapminmax. According to the docs, mapminmax will "Process matrices by mapping row minimum and maximum values to [-1 1]". That's why your "shifting" (re-sampling) your inputs arbitrarily had no effect.
I assume that by "calculating the output" you mean checking the performance of your network. You can do that by first simulating your network on a dataset (see doc nnet/sim) and then checking the "correctness" with the perform function. It will use the same cost function as you did when training:
% get predictions
[Y] = sim(net2,T);
% see how well we did
perf = perform(net2,T,Y);
Boomshuckalucka. Hope that helps.
I am trying to get a prediction column matrix in MATLAB but I don't quite know how to go about coding it. My current code is -
load DataWorkspace.mat
groups = ismember(Num,'Yes');
k=10;
%# number of cross-validation folds:
%# If you have 50 samples, divide them into 10 groups of 5 samples each,
%# then train with 9 groups (45 samples) and test with 1 group (5 samples).
%# This is repeated ten times, with each group used exactly once as a test set.
%# Finally the 10 results from the folds are averaged to produce a single
%# performance estimation.
cvFolds = crossvalind('Kfold', groups, k);
cp = classperf(groups);
for i = 1:k
testIdx = (cvFolds == i);
trainIdx = ~testIdx;
svmModel = svmtrain(Data(trainIdx,:), groups(trainIdx), ...
'Autoscale',true, 'Showplot',false, 'Method','SMO', ...
'Kernel_Function','rbf');
pred = svmclassify(svmModel, Data(testIdx,:), 'Showplot',false);
%# evaluate and update performance object
cp = classperf(cp, pred, testIdx);
end
cp.CorrectRate
cp.CountingMatrix
The issue is that it's actually calculating the accuracy 11 times in total - 10 times for each fold and one final time as an average. But if I take the individual predictions of each fold and print pred for each loop, the accuracy understandable reduces greatly.
However, I need a column matrix of the predicted values for each row of the data. Any ideas on how I can go about modifying the code?
The whole idea of cross-validation is get an unbiased estimate of the performance of a classifier.
Once that done, you usually just train a model over the entire data. This model will be used to predict future instances.
So just do:
svmModel = svmtrain(Data, groups, ...);
pred = svmclassify(svmModel, otherData, ...);
I am working on thumb recognition system. I need to implement KNN algorithm to classify my images. according to this, it has only 2 measurements, through which it is calculating the distance to find the nearest neighbour but in my case I have 400 images of 25 X 42, in which 200 are for training and 200 for testing. I am searching for few hours but I am not finding the way to find the distance between the points.
EDIT:
I have reshaped 1st 200 images in to 1 X 1050 and stored them in a matrix trainingData of 200 X 1050. similarly I made testingData.
Here is an illustration code for k-nearest neighbor classification (some functions used require the Statistics toolbox):
%# image size
sz = [25,42];
%# training images
numTrain = 200;
trainData = zeros(numTrain,prod(sz));
for i=1:numTrain
img = imread( sprintf('train/image_%03d.jpg',i) );
trainData(i,:) = img(:);
end
%# testing images
numTest = 200;
testData = zeros(numTest,prod(sz));
for i=1:numTest
img = imread( sprintf('test/image_%03d.jpg',i) );
testData(i,:) = img(:);
end
%# target class (I'm just using random values. Load your actual values instead)
trainClass = randi([1 5], [numTrain 1]);
testClass = randi([1 5], [numTest 1]);
%# compute pairwise distances between each test instance vs. all training data
D = pdist2(testData, trainData, 'euclidean');
[D,idx] = sort(D, 2, 'ascend');
%# K nearest neighbors
K = 5;
D = D(:,1:K);
idx = idx(:,1:K);
%# majority vote
prediction = mode(trainClass(idx),2);
%# performance (confusion matrix and classification error)
C = confusionmat(testClass, prediction);
err = sum(C(:)) - sum(diag(C))
If you want to compute the Euclidean distance between vectors a and b, just use Pythagoras. In Matlab:
dist = sqrt(sum((a-b).^2));
However, you might want to use pdist to compute it for all combinations of vectors in your matrix at once.
dist = squareform(pdist(myVectors, 'euclidean'));
I'm interpreting columns as instances to classify and rows as potential neighbors. This is arbitrary though and you could switch them around.
If have a separate test set, you can compute the distance to the instances in the training set with pdist2:
dist = pdist2(trainingSet, testSet, 'euclidean')
You can use this distance matrix to knn-classify your vectors as follows. I'll generate some random data to serve as example, which will result in low (around chance level) accuracy. But of course you should plug in your actual data and results will probably be better.
m = rand(nrOfVectors,nrOfFeatures); % random example data
classes = randi(nrOfClasses, 1, nrOfVectors); % random true classes
k = 3; % number of neighbors to consider, 3 is a common value
d = squareform(pdist(m, 'euclidean')); % distance matrix
[neighborvals, neighborindex] = sort(d,1); % get sorted distances
Take a look at the neighborvals and neighborindex matrices and see if they make sense to you. The first is a sorted version of the earlier d matrix, and the latter gives the corresponding instance numbers. Note that the self-distances (on the diagonal in d) have floated to the top. We're not interested in this (always zero), so we'll skip the top row in the next step.
assignedClasses = mode(neighborclasses(2:1+k,:),1);
So we assign the most common class among the k nearest neighbors!
You can compare the assigned classes with the actual classes to get an accuracy score:
accuracy = 100 * sum(classes == assignedClasses)/length(classes);
fprintf('KNN Classifier Accuracy: %.2f%%\n', 100*accuracy)
Or make a confusion matrix to see the distribution of classifications:
confusionmat(classes, assignedClasses)
yes, there is a function for knn : knnclassify
Play around with the number of neighbors you want to keep in order to get the best result (use a confusion matrix). This function takes care of the distance, of course.