I'm doing quite simple SVM classification at the moment. I use a precomputed kernel in LibSVM with RBF and DTW.
When I compute the similarity (kernel-) matrix, everything seems to work very fine ... until I permute my data, before I compute the kernel matrix.
An SVM is of course invariant to permutations of input-data. In the below Matlab-code, the line marked with '<- !!!!!!!!!!' decides about the classification accuracy (not permuted: 100% -- permuted: 0% to 100%, dependant on the seed of rng). But why does permuting the file-string-array (named fileList) make any difference? What am I doing wrong? Have I misunderstood the concept of 'permutation invariance' or is it a problem with my Matlab-code?
My csv-files are formatted as: LABEL, val1, val2, ..., valN and all the csv-files are stored in the folder dirName. So the string array contains the entries '10_0.csv 10_1.csv .... 11_7.csv, 11_8.csv' (not permuted) or some other order when permuted.
I also tried to permute the vector of sample serial numbers, too, but that makes no difference.
function [SimilarityMatrixTrain, SimilarityMatrixTest, trainLabels, testLabels, PermSimilarityMatrixTrain, PermSimilarityMatrixTest, permTrainLabels, permTestLabels] = computeDistanceMatrix(dirName, verificationClass, trainFrac)
fileList = getAllFiles(dirName);
fileList = fileList(1:36);
trainLabels = [];
testLabels = [];
trainFiles = {};
testFiles = {};
permTrainLabels = [];
permTestLabels = [];
permTrainFiles = {};
permTestFiles = {};
n = 0;
sigma = 0.01;
trainFiles = fileList(1:2:end);
testFiles = fileList(2:2:end);
rng(3);
permTrain = randperm(length(trainFiles))
%rng(3); <- !!!!!!!!!!!
permTest = randperm(length(testFiles));
permTrainFiles = trainFiles(permTrain)
permTestFiles = testFiles(permTest);
noTrain = size(trainFiles);
noTest = size(testFiles);
SimilarityMatrixTrain = eye(noTrain);
PermSimilarityMatrixTrain = (noTrain);
SimilarityMatrixTest = eye(noTest);
PermSimilarityMatrixTest = eye(noTest);
% UNPERM
%Train
for i = 1 : noTrain
x = csvread(trainFiles{i});
label = x(1);
trainLabels = [trainLabels, label];
for j = 1 : noTrain
y = csvread(trainFiles{j});
dtwDistance = dtwWrapper(x(2:end), y(2:end));
rbfValue = exp((dtwDistance.^2)./(-2*sigma));
SimilarityMatrixTrain(i, j) = rbfValue;
n=n+1
end
end
SimilarityMatrixTrain = [(1:size(SimilarityMatrixTrain, 1))', SimilarityMatrixTrain];
%Test
for i = 1 : noTest
x = csvread(testFiles{i});
label = x(1);
testLabels = [testLabels, label];
for j = 1 : noTest
y = csvread(testFiles{j});
dtwDistance = dtwWrapper(x(2:end), y(2:end));
rbfValue = exp((dtwDistance.^2)./(-2*sigma));
SimilarityMatrixTest(i, j) = rbfValue;
n=n+1
end
end
SimilarityMatrixTest = [(1:size(SimilarityMatrixTest, 1))', SimilarityMatrixTest];
% PERM
%Train
for i = 1 : noTrain
x = csvread(permTrainFiles{i});
label = x(1);
permTrainLabels = [permTrainLabels, label];
for j = 1 : noTrain
y = csvread(permTrainFiles{j});
dtwDistance = dtwWrapper(x(2:end), y(2:end));
rbfValue = exp((dtwDistance.^2)./(-2*sigma));
PermSimilarityMatrixTrain(i, j) = rbfValue;
n=n+1
end
end
PermSimilarityMatrixTrain = [(1:size(PermSimilarityMatrixTrain, 1))', PermSimilarityMatrixTrain];
%Test
for i = 1 : noTest
x = csvread(permTestFiles{i});
label = x(1);
permTestLabels = [permTestLabels, label];
for j = 1 : noTest
y = csvread(permTestFiles{j});
dtwDistance = dtwWrapper(x(2:end), y(2:end));
rbfValue = exp((dtwDistance.^2)./(-2*sigma));
PermSimilarityMatrixTest(i, j) = rbfValue;
n=n+1
end
end
PermSimilarityMatrixTest = [(1:size(PermSimilarityMatrixTest, 1))', PermSimilarityMatrixTest];
mdlU = svmtrain(trainLabels', SimilarityMatrixTrain, '-t 4 -c 0.5');
mdlP = svmtrain(permTrainLabels', PermSimilarityMatrixTrain, '-t 4 -c 0.5');
[pclassU, xU, yU] = svmpredict(testLabels', SimilarityMatrixTest, mdlU);
[pclassP, xP, yP] = svmpredict(permTestLabels', PermSimilarityMatrixTest, mdlP);
xU
xP
end
I'd be very thankful for any answer!
Regards
Benjamin
after cleaning up the code and letting a colleague of mine have a look on it, we/he finally found the bug. Of course, I have to compute the testing matrix from the training and testing samples (to let the SVM predict the testing data by using the sum over the product of alpha-values of the training vectors (they are zero for non support vectors)). Hope this clarifies the problem for any of you. To make it more clear, see my revised code below. But, as for example in using precomputed kernels with libsvm, there one with sharp eyes can also see the computation of the testing matrix with train and test vectors, too. Feel free to put comments or/and answers to this post if you have any further remarks/questions/tips!
function [tacc, testacc, mdl, SimilarityMatrixTrain, SimilarityMatrixTest, trainLabels, testLabels] = computeSimilarityMatrix(dirName)
fileList = getAllFiles(dirName);
fileList = fileList(1:72);
trainLabels = [];
testLabels = [];
trainFiles = {};
testFiles = {};
n = 0;
sigma = 0.01;
trainFiles = fileList(1:2:end);
testFiles = fileList(2:5:end);
noTrain = size(trainFiles);
noTest = size(testFiles);
permTrain = randperm(noTrain(1));
permTest = randperm(noTest(1));
trainFiles = trainFiles(permTrain);
testFiles = testFiles(permTest);
%Train
for i = 1 : noTrain(1)
x = csvread(trainFiles{i});
label = x(1);
trainlabel = label;
trainLabels = [trainLabels, label];
for j = 1 : noTrain(1)
y = csvread(trainFiles{j});
dtwDistance = dtwWrapper(x(2:end), y(2:end));
rbfValue = exp((dtwDistance.^2)./(-2*sigma.^2));
SimilarityMatrixTrain(i, j) = rbfValue;
end
end
SimilarityMatrixTrain = [(1:size(SimilarityMatrixTrain, 1))', SimilarityMatrixTrain];
%Test
for i = 1 : noTest(1)
x = csvread(testFiles{i});
label = x(1);
testlabel = label;
testLabels = [testLabels, label];
for j = 1 : noTrain(1)
y = csvread(trainFiles{j});
dtwDistance = dtwWrapper(x(2:end), y(2:end));
rbfValue = exp((dtwDistance.^2)./(-2*sigma.^2));
SimilarityMatrixTest(i, j) = rbfValue;
end
end
SimilarityMatrixTest = [(1:size(SimilarityMatrixTest, 1))', SimilarityMatrixTest];
mdlU = svmtrain(trainLabels', SimilarityMatrixTrain, '-t 4 -c 1000 -q');
fprintf('TEST: '); [pclassU, xU, yU] = svmpredict(testLabels', SimilarityMatrixTest, mdlU);
fprintf('TRAIN: ');[pclassT, xT, yT] = svmpredict(trainLabels', SimilarityMatrixTrain, mdlU);
tacc = xT(1);
testacc = xU(1);
mdl = mdlU;
end
Regards
Benjamin
Related
Trying to find the optimal hyperparameters for my svm model using a grid search, but it simply returns 1 for the hyperparameters.
function evaluations = inner_kfold_trainer(C,q,k,features_xy,labels)
features_xy_flds = kdivide(features_xy, k);
labels_flds = kdivide(labels, k);
evaluations = zeros(k,3);
for i = 1:k
fprintf('Fold %i of %i\n',i,k);
train_data = cell2mat(features_xy_flds(1:end ~= i));
train_labels = cell2mat(labels_flds(1:end ~= i));
test_data = cell2mat(features_xy_flds(i));
test_labels = cell2mat(labels_flds(i));
%AU1
train_labels = train_labels(:,1);
test_labels = test_labels(:,1);
[k,~] = size(test_labels);
%train
sv = fitcsvm(train_data,train_labels, 'KernelFunction','polynomial', 'PolynomialOrder',q,'BoxConstraint',C);
sv.predict(test_data);
%Calculate evaluative measures
%svm_outputs = zeros(k,1);
sv_predictions = sv.predict(test_data);
[precision,recall,F1] = evaluation(sv_predictions,test_labels);
evaluations(i,1) = precision;
evaluations(i,2) = recall;
evaluations(i,3) = F1;
end
save('eval.mat', 'evaluations');
end
an inner-fold cross validation function
and below the grid function where something seems to be going wrong
function [q,C] = grid_search(features_xy,labels,k)
% n x n grid
n = 3;
q_grid = linspace(1,19,n);
C_grid = linspace(1,59,n);
tic
evals = zeros(n,n,3);
for i = 1:n
for j = 1:n
fprintf('## i=%i, j=%i ##\n', i, j);
svm_results = inner_kfold_trainer(C_grid(i), q_grid(j),k,features_xy,labels);
evals(i,j,:) = mean(svm_results(:,:));
% precision only
%evals(i,j,:) = max(svm_results(:,1));
toc
end
end
f = evals;
% retrieving the best value of the hyper parameters, to use in the outer
% fold
[M1,I1] = max(f);
[~,I2] = max(M1(1,1,:));
index = I1(:,:,I2);
C = C_grid(index(1))
q = q_grid(index(2))
end
When I run grid_search(features_xy,labels,8) for example, I get C=1 and q=1, for any k(the no. of folds) value. Also features_xy is a 500*98 matrix.
I made Clamped Cubic Spline code.
But when I put
f_ = CubicSpline([0,1,2,3],[exp(0),exp(1),exp(2),exp(3)],exp(0),exp(3));
and get answer by sym2poly(f_(1))
result is quite different from my lecture note. And actually, my Cubic Spline result even doesn't match to the prime of bound...
Please I can't understand what is the problem in my code.
This is what I used for my algorithm.
function [f_] = CubicSpline(x0,f0,FPO,FPN)
syms x;
n = length(x0);
h = zeros(n,1);
alpha = zeros(n,1);
l = zeros(n,1);
u = zeros(n,1);
z = zeros(n,1);
a = zeros(n,1);
b = zeros(n,1);
c = zeros(n,1);
d = zeros(n,1);
for iter = 1:n-1
h(iter) = x0(iter+1)-x0(iter);
end
alpha(1) = 3*(f0(2)-f0(1))/h(1)-3*FPO;
alpha(n) = 3*FPN-3*(f0(n)-f0(n-1))/h(n-1);
for iter = 1:n
a(iter) = f0(iter);
end
for iter = 2:n-1
alpha(iter) = 3/h(iter)*(f0(iter+1)-f0(iter))-3/h(iter-1)*(f0(iter)-f0(iter-1));
end
l(1) = 2*h(1);
u(1) = 0.5;
z(1) = f0(1)/l(1);
for iter = 2:n-1
l(iter) = 2*(x0(iter+1)-x0(iter-1)) - h(iter-1)*u(iter-1);
u(iter) = h(iter)/l(iter);
z(iter) = (alpha(iter)-h(iter-1)*z(iter-1))/l(iter);
end
l(n) = h(n-1)*(2-u(n-1));
z(n) = (alpha(n)-h(n-1)*z(n-1))/l(n);
c(n) = z(n);
for iter = (n-1):-1:1
c(iter) = z(iter)-u(iter)*c(iter+1);
b(iter) = (f0(iter+1)-f0(iter))/h(iter)-h(iter)*(c(iter+1)+2*c(iter))/3;
d(iter) = (c(iter+1)-c(iter))/(3*h(iter));
end
for iter = 1:n-1
f_(iter) = a(iter) + b(iter)*(x-x0(iter)) + c(iter)*(x-x0(iter))^2 + d(iter)*(x-x0(iter))^3;
end
end
There is a typo in your code for step 4
z(1) = f0(1)/l(1);
should be
z(1) = alpha(1)/l(1);
I'm trying to implement k-NN in matlab. I have a matrix of 214 x's that have 9 columns of attributes with the 10th column being the label. I want to measure loss with a 0-1 function on 10 cross-validation tests. I have the following code:
function q3(file)
data = knnfile(file);
loss(data(:,1:9),'KFold',data(:,10))
losses = zeros(25,3);
new_data = data;
new_data(:,10) = [];
sdd = std(new_data);
meand = mean(new_data);
for s = 1:214
for q = 1:9
new_data(s,q) = (new_data(s,q) - meand(q)) / sdd(q);
end
end
new_data = [new_data data(:,10)];
for k = 1:25
loss1 = 0;
loss2 = 0;
for j = 0:9
index = floor(214/10)*j+1;
curd1 = data([1:index-1,index+21:end],:);
curd2 = new_data([1:index-1,index+21:end],:);
for l = 0:20
c1 = knn(curd1,k,data(index+l,:));
c2 = knn(curd2,k,new_data(index+l,:));
loss1 = loss1 + (c1 ~= data(index+l,10));
loss2 = loss2 + (c2 ~= new_data(index+l,10));
end
end
losses(k,1) = k;
losses(k,2) = 100*loss1/210;
losses(k,3) = 100*loss2/210;
end
function cluster = knn(Data,k,x)
distances = zeros(193,2);
for i = 1:size(Data,1)
row = Data(i,:);
d = norm(row(1:size(row,2)-1) - x(1:size(x,2)-1));
distances(i,:) = [d row(10)];
end
distances = sortrows(distances,1);
cluster = mode(distances(1:k,2));
I'm getting 40%+ loss with almost no correlation to k and I'm sure that something here is wrong but I'm not quite sure.
Any help would be appreciated!
I executed this code using Feature Matrix 517*11 and Label Matrix 517*1. But once the dimensions of matrices change the code cant be run. How can I fix this?
The error is:
Subscripted assignment dimension mismatch.
in this line :
edges(k,j) = quantlevels(a);
Here is my code:
function [features,weights] = MI(features,labels,Q)
if nargin <3
Q = 12;
end
edges = zeros(size(features,2),Q+1);
for k = 1:size(features,2)
minval = min(features(:,k));
maxval = max(features(:,k));
if minval==maxval
continue;
end
quantlevels = minval:(maxval-minval)/500:maxval;
N = histc(features(:,k),quantlevels);
totsamples = size(features,1);
N_cum = cumsum(N);
edges(k,1) = -Inf;
stepsize = totsamples/Q;
for j = 1:Q-1
a = find(N_cum > j.*stepsize,1);
edges(k,j) = quantlevels(a);
end
edges(k,j+2) = Inf;
end
S = zeros(size(features));
for k = 1:size(S,2)
S(:,k) = quantize(features(:,k),edges(k,:))+1;
end
I = zeros(size(features,2),1);
for k = 1:size(features,2)
I(k) = computeMI(S(:,k),labels,0);
end
[weights,features] = sort(I,'descend');
%% EOF
function [I,M,SP] = computeMI(seq1,seq2,lag)
if nargin <3
lag = 0;
end
if(length(seq1) ~= length(seq2))
error('Input sequences are of different length');
end
lambda1 = max(seq1);
symbol_count1 = zeros(lambda1,1);
for k = 1:lambda1
symbol_count1(k) = sum(seq1 == k);
end
symbol_prob1 = symbol_count1./sum(symbol_count1)+0.000001;
lambda2 = max(seq2);
symbol_count2 = zeros(lambda2,1);
for k = 1:lambda2
symbol_count2(k) = sum(seq2 == k);
end
symbol_prob2 = symbol_count2./sum(symbol_count2)+0.000001;
M = zeros(lambda1,lambda2);
if(lag > 0)
for k = 1:length(seq1)-lag
loc1 = seq1(k);
loc2 = seq2(k+lag);
M(loc1,loc2) = M(loc1,loc2)+1;
end
else
for k = abs(lag)+1:length(seq1)
loc1 = seq1(k);
loc2 = seq2(k+lag);
M(loc1,loc2) = M(loc1,loc2)+1;
end
end
SP = symbol_prob1*symbol_prob2';
M = M./sum(M(:))+0.000001;
I = sum(sum(M.*log2(M./SP)));
function y = quantize(x, q)
x = x(:);
nx = length(x);
nq = length(q);
y = sum(repmat(x,1,nq)>repmat(q,nx,1),2);
I've run the function several times without getting any error.
I've used as input for "seq1" and "seq2" arrays such as 1:10 and 11:20
Possible error might rise in the loops
for k = 1:lambda1
symbol_count1(k) = sum(seq1 == k);
end
if "seq1" and "seq2" are defined as matrices since sum will return an array while
symbol_count1(k)
is expected to be single value.
Another possible error might rise if seq1 and seq2 are not of type integer since they are used as indexes in
M(loc1,loc2) = M(loc1,loc2)+1;
Hope this helps.
I am very new to Matlab. What i am trying to do is classify the iris dataset using Cross-Validation (that means that i have to split the dataset in 3: trainingSet, validationSet, and test set) . In my mind everything i write here is ok (beeing a beginner is hard sometimes). So i could use a little help...
This is the function that splits the data (first 35(70% of the data) are the training set, the rest is the validation set(15%) and 15% i will use later for the test set)
close all; clear ;
load fisheriris;
for i = 1:35
for j = 1:4
trainSeto(i,j) = meas(i,j);
end
end
for i = 51:85
for j = 1:4
trainVers(i-50,j) = meas(i,j);
end
end
for i = 101:135
for j = 1:4
trainVirg(i-100,j) = meas(i,j);
end
end
for i = 36:43
for j = 1:4
valSeto(i-35,j) = meas(i,j);
end
end
for i = 86:93
for j = 1:4
valVers(i-85,j) = meas(i,j);
end
end
for i = 136:143
for j = 1:4
valVirg(i-135,j) = meas(i,j);
end
end
for i = 44:50
for j = 1:4
testSeto(i-43,j) = meas(i,j);
end
end
for i = 94:100
for j = 1:4
testVers(i-93,j) = meas(i,j);
end
end
for i = 144:150
for j = 1:4
testVirg(i-143,j) = meas(i,j);
end
end
And this is the main script:
close all; clear;
%%the 3 tipes of iris
run divinp
% the representation of the 3 classes(their coding)
a = [-1 -1 +1]';
b = [-1 +1 -1]';
c = [+1 -1 -1]';
%training set
trainInp = [trainSeto trainVers trainVirg];
%the targets
T = [repmat(a,1,length(trainSeto)) repmat(b,1,length(trainVers)) repmat(c,1,length(trainVirg))];
%%the training
trainCor = zeros(10,10);
valCor = zeros(10,10);
Xn = zeros(1,10);
Yn = zeros(1,10);
for k = 1:10,
Yn(1,k) = k;
for n = 1:10,
Xn(1,n) = n;
net = newff(trainInp,T,[k n],{},'trainbfg');
net = init(net);
net.divideParam.trainRatio = 1;
net.divideParam.valRatio = 0;
net.divideParam.testRatio = 0;
net.trainParam.max_fail = 2;
valInp = [valSeto valVers valVirg];
valT = [repmat(a,1,length(valSeto)) repmat(b,1,length(valVers)) repmat(c,1,length(valVirg))];
[net,tr] = train(net,trainInp,T);
Y = sim(net,trainInp);
[Yval,Pfval,Afval,Eval,perfval] = sim(net,valInp,[],[],valT);
% calculate [%] of correct classifications
trainCor(k,n) = 100 * length(find(T.*Y > 0)) / length(T);
valCor(k,n) = 100 * length(find(valT.*Yval > 0)) / length(valT);
end
end
figure
surf(Xn,Yn,trainCor/3);
view(2)
figure
surf(Xn,Yn,valCor/3);
view(2)
I get this error
Error using trainbfg (line 120) Inputs and targets have different
numbers of samples.
Error in network/train (line 106) [net,tr] =
feval(net.trainFcn,net,X,T,Xi,Ai,EW,net.trainParam);
Error in ClassIris (line 38)
[net,tr] = train(net,trainInp,T);
close all; clear ;
load fisheriris;
trainSetoIndx = 1:35;
trainVersIndx = 51:85; % or: trainVersIndx = trainSetoIndx + 50;
trainVirgIndx = 101:135;
colIndx = 1:4;
trainSeto = meas(trainSetoIndx, colIndx);
trainVers = meas(trainVersIndx, colIndx);
trainVirg = meas(trainVirgIndx, colIndx);
valSetoIndx = 36:43;
valVersIndx = 86:93;
valVirgIndx = 136:143
valSeto = meas(valSetoIndx, colIndx);
valVers = meas(valVersIndx, colIndx);
valVirg = meas(valVirgIndx, colIndx);
testSetoIndx = 44:50;
testVersIndx = 94:100;
testVirgIndx = 144:150
testSeto = meas(testSetoIndx, colIndx);
testVers = meas(testVersIndx, colIndx);
testVirg = meas(testVirgIndx, colIndx);
i have writen it with ":" also still the same problem it's something with repmat.. i don't know how to use it properly or newff :D
Just to get you started, you can rewrite your code loops as follows:
trainSetoIndx = 1:35;
trainVersIndx = 51:85; % or: trainVersIndx = trainSetoIndx + 50;
trainVirgIndx = 101:135; % or: trainVirgIndx = trainSetoIndx + 100;
colIndx = 1:4; % can't tell if this is all the columns in meas
trainSeto = meas(trainIndx, colIndx);
trainVers = meas(trainVersIndx, colIndx);
trainVirg = meas(trainVirgIndx, colIndx);
The do the same thing for all the others:
valSetoIndx = 36:43;
etc.
Next, simply type whos at the command prompt and you will see the sizes of all the arrays you have created. See whether the ones that need to be the same size have, in fact, the same dimensions.