Cross validation for SVM-regression

Cross validation for SVM-regression - matlab

I want to perform a Cross Validation to select the best parameters Gamma and C for the RBF Kernel of the SVR (Support Vector Regression). I'm using LIBSVM. I have a database that contains 4 groups of 3D meshes.
My question is:
is this approach I am using is ok for 4-fold Cross Validation? I think, for selecting the parameters C and Gamma of the RBF Kernal, I must minimize the error between the predicted values and the groud_truth_values.
I have also another problem, I get this a NAN value while the Cross-Validation (Squared correlation coefficient = nan (regression))
Here is the code i wrote:
[C,gamma] = meshgrid(-5:2:15, -15:2:3); %range of values for C and
%gamma
%# grid search, and cross-validation
for m=1:numel(C)
for k=1:4
fid1 = fopen(sprintf('list_learning_%d.txt',k), 'rt');
i=1;
while feof(fid1) == 0
tline = fgetl(fid1);
v= load(tline);
v=normalize(v);
matrix_feature_tmp(i,:)=v;
i=i+1;
end
fclose(fid1);
% I fill matrix_feature_train of size m by n via matrix_feature_tmp
%%construction of the test matrix
fid2 = fopen(sprintf('liste_features_test%d.txt',k), 'rt');
i=1;
while feof(fid2) == 0
tline = fgetl(fid2);
v= load(tline);
v=normalize(v);
matrice_feature_test_tmp(i,:)=v;
i=i+1;
end
fclose(fid2);
%I fill matrix_feature_test of size m by k via matrix_feature_test_tmp
mos_learning=load(sprintf('mos_learning_%d.txt',k));
mos_wanted=load(sprintf('mos_test%d.txt',k));
model = svmtrain(mos_learning, matrix_feature_train',sprintf('-
s %f -t %f -c %f -g %f -p %f ',3,2 ,2^C(m),2^gamma(m),1 ));
[y_hat, Acc, projection] = svmpredict(mos_wanted,
matrix_feature_test', model);
MSE_Test = mean((y_hat-mos_wanted).^2);
vecc_error(k)=MSE_Test;
end
mean_vec_error_fold(m)=mean(vecc_error);
end
%select the best gamma and C
[~,idx]=min(mean_vec_error_fold);
best_C = 2^C(idx);
best_gamma = 2^gamma(idx);
%training with best parameters
%for example
model = svmtrain(mos_learning1, matrice_feature_train1',sprintf('-s
%f -t %f -c %f -g %f -p %f ',3,2 ,best_C, best_gamma,1 ));
[y_hat_final, Acc, projection] = svmpredict(mos_test1,matrice_feature_test1',
model);

Based on your description, without reading your code, it sounds like you are NOT doing cross-validation. Cross-validation requires you to pick a parameter set (i.e. a value for C and gamma) and holding those parameters constant use k-1 folds to train, 1 fold to test and to do this k times such that you use each fold as the test set once. Then aggregate the error / accuracy measure for these k tests and that is the measure you use to rank those parameters for a model trained on ALL the data. Call this your cross-validation error for the parameter set you used. You then repeat this process for a range of different parameters and choose the parameter set with the best accuracy / lowest CV error. Your final model is trained on all your data.
Your code doesn't really make sense to me. Looking at this snippet
folds = 4;
for i=1:numel(C)
cv_acc(i) = svmtrain(ground_truth, matrice_feature_train', ...
sprintf(' -s %d -t %d -c %f -g %f -p %d -v %d',3,2,
2^C(i), 2^gamma(i), 1, 4)); %Kernel RBF
end
What is it that cv_acc contains? To me it contains the actual SVM model (an SVMStruct if you use the MATLAB toolbox, something else if you used LIBSVM). This would be OK IF you were using your loop to change which folds are used as the training set. However you have used them to change the value of your gamma and C parameters, which is incorrect. However you later call min(cv_acc); so I'm now guessing that you think the call to smvtrain actually returned the training error? I don't see how you can meaningfully call min on an array of structures like that, but I could be wrong. But even so, you aren't actually interested in minimising your training error, you want to minimise your cross-validation error which is the aggregate of the test error from your k runs and has nothing to do with your training error.
Now it's impossible to actually know if you've done this bt wrong since you don't show us the vectors of gamma and C but it's strange to only have 1 loop rather than a nested loop to iterate through these (unless you have arranged them like a truth-table but I doubt that). You need to test each potential value of C paired with each value of gamma. Currently it looks like you're only trying 1 different value of gamma for each value in C.
Have a look at this answer to see an example of cross-validation used with SVM.

Related

matlab libsvm: unable to predict

I am using libsvm on Matlab. I want to build a model and use this model for prediction.
It is wired that the returns of svmpredict ([predict_label, accuracy_all, prob_values]) are empty. Here is my simple code:
svm_model = svmtrain([train_label],[train],'-t 2, -c 100 -q');
[predict_label, accuracy_all, prob_values] = svmpredict(testlabels,testdata,svm_model,'-q, -b 1');
[predict_label, accuracy_all, prob_values] are 0x0 matrix. And also Matlab also shows some warning information:
Usage: [predicted_label, accuracy, decision_values/prob_estimates] = svmpredict(testing_label_vector, testing_instance_matrix, model, 'libsvm_options')
Parameters:
model: SVM model structure from svmtrain.
libsvm_options:
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); one-class SVM not supported yet
Returns:
predicted_label: SVM prediction output vector.
accuracy: a vector with accuracy, mean squared error, squared correlation coefficient.
prob_estimates: If selected, probability estimate vector.
Can anyone help me?

What is q in the SVM model ? where is it's value ?
there is two parameters in the SVM which need to be well defined c and g , you have put 100 as value of c but ther is no value for q (or must be g which is called gamma)
this what you need
cmd = ['-t 2 -c ',num2str(C), ' -g ',num2str(gamma) ];
model = svmtrain2(trainClass, trainData, cmd);
[predClass, acc, decVals] = svmpredict(testClass, testData, model);
Also i think that svmtrain must be renamed svmtrain2 to avoid the confusion with svmtrain function of matlab.

libsvm cross validation with precomputed kernel in matlab

I am trying to do a 5 fold cross validation with libsvm (matlab) using a precomputed kernel, but, I get the following error message :
Undefined function 'ge' for input arguments of type 'struct'.
this is because the Libsvm return a structure instead of a value in cross validation, How can I solve this problem, this is my code:
load('iris.dat')
data=iris(:,1:4);
class=iris(:,5);
% normalize the data
range=repmat((max(data)-min(data)),size(data,1),1);
data=(data-repmat(min(data),size(data,1),1))./range;
% train
tr_data=[data(1:5,:);data(52:56,:);data(101:105,:)];
tr_lbl=[ones(5,1);2*ones(5,1);3*ones(5,1)];
% kernel computation
sigma=.8
rbfKernel = #(X,Y,sigma) exp((-pdist2(X,Y,'euclidean').^2)./(2*sigma^2));
Ktr=[(1:15)',rbfKernel(tr_data,tr_data,sigma)];
kts=[ (1:150)',rbfKernel(data,tr_data,sigma)];
% svmptrain
bestcv = 0;
for log2c = -1:3
cmd = ['Ktr -t 4 -v 5 -c ', num2str(2^log2c)];
cv = svmtrain2(tr_lbl,tr_data, cmd);
if (cv >= bestcv)
bestcv = cv;
bestc = 2^log2c;
end
end
cmd=['-s 0 -c ', num2str(bestc), 'Ktr -t 4']
model=svmtrain2(tr_lbl,tr_data,cmd)
% svm predict
labels=svmpredict(class,data,model,kts)

The function svmtrain2 you are using is not part of standard MATLAB and also the output of the function is not a structure. But if you insist to use that, you can calculate an score for data using the other existing function:
[f,K] = svmeval(X_eval,varargin)
that evaluates the trained svm using the outputs from svmtrain2. But I prefer to use first the standard functions embedded in MATLAB. In standard MATLAB library there is:
SVMStruct = svmtrain(Training,Group)
that returns a structure, SVMStruct, containing information about the trained support vector machine (SVM) classifier. or
SVMModel = fitcsvm(X,Y)
that returns a support vector machine classifier SVMModel, trained by predictors X and class labels Y for one- or two-class classification. and then you can get some score for each prediction using:
[label,Score] = predict(SVMModel,X)
that returns class likelihood measures, i.e., either scores or posterior probabilities.

You get that error because you are trying to compare a struct and a number.
If what you want is to find the best performance in the training set (as it seems from you comparison), I don't think you can get it directly from the structure returned from svmtrain. You should first use svmpredict with the training set and the trained model, and you can get the accuracy from the resulting structure.

Bad results when testing libsvm in matlab

can someone help me to solve this?
I want to test whether this classification is already good or not. So, I try with data testing=data training. it will give 100% (acc) if the classification is good.
this is the code that I found from this site:
data= [170 66 ;
160 50 ;
170 63 ;
173 61 ;
168 58 ;
184 88 ;
189 94 ;
185 88 ]
labels=[-1;-1;-1;-1;-1;1;1;1];
numInst = size(data,1);
numLabels = max(labels);
testVal = [1 2 3 4 5 6 7 8];
trainLabel = labels(testVal,:);
trainData = data(testVal,:);
testData=data(testVal,:);
testLabel=labels(testVal,:);
numTrain = 8; numTest =8
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(trainLabel==k), trainData, '-c 1 -t 2 -g 0.2 -b 1');
end
%# get probability estimates of test instances using each model
prob = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(testLabel==k), testData, model{k}, '-b 1');
prob(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred] = max(prob,[],2);
acc = sum(pred == testLabel) ./ numel(testLabel) %# accuracy
C = confusionmat(testLabel, pred) %# confusion matrix
and this is the results:
optimization finished, #iter = 16
nu = 0.645259 obj = -2.799682,
rho = -0.437644 nSV = 8, nBSV = 1 Total nSV = 8
Accuracy = 100% (8/8) (classification)
acc =
0.3750
C =
0 5
0 3
I dont know why there's two accuracy, and its different. the first one is 100% and the second one is 0.375. is my code false? it should be 100% not 37.5%. Can u help me to correct this code??

If your using libsvm then you should change the name of the MEX file since Matlab already has a svm toolbox with the name svmtrain. However, the code is running so it seems you did change the name just not on the code you provided.
The second one is wrong, don't know exactly why. However, I can tell you that you will almost always get 100% accuracy if you use the test_Data = training_Data. That result really does not mean anything, since the algorithm can be overfit and not be shown in your results. Test your algorithm against new data and that will give you a realistic accuracy.

Is that the code you're using? I don't think your svmtrain invocation is valid. You should have svmtrain(MAT, VECT, ...) where MAT is a matrix of data, and VECT is a vector with the labels of each row of MAT. The remaining parameters are string-value pairs, meaning you'll have a string identifier and its corresponding valie.
When I ran your code (Linux, R2011a) I got an error on the svmtrain call. Running with svmtrain(trainData, double(trainLabel==k)) gave a valid output (for that line). Of course, it appears that you're not using pure matlab, as your svmpredict call isn't native matlab, but rather a matlab binding from LIBSVM...

C = confusionmat(testLabel, pred)
swap their positions
C= confusionmat(pred,testLabel)
or use this
[ConMat,order] = confusionmat(pred,testLabel)
shows the confusion matrix and the class order

The problem is in
[~,~,p] = svmpredict(double(testLabel==k), testData, model{k}, '-b 1');
p does not contain the predicted labels, it has the probability estimates of the labels being correct. LIBSVM's svmpredict already calculates accuracy for you correctly, that's why it says 100% in the debug output.
The fix is simple:
[p,~,~] = svmpredict(double(testLabel==k), testData, model{k}, '-b 1');
According to LIBSVM's Matlab bindings README:
The function 'svmpredict' has three outputs. The first one,
predictd_label, is a vector of predicted labels. The second output,
accuracy, is a vector including accuracy (for classification), mean
squared error, and squared correlation coefficient (for regression).
The third is a matrix containing decision values or probability
estimates (if '-b 1' is specified). If k is the number of classes
in training data, for decision values, each row includes results of
predicting k(k-1)/2 binary-class SVMs. For classification, k = 1 is a
special case. Decision value +1 is returned for each testing instance,
instead of an empty vector. For probabilities, each row contains k values
indicating the probability that the testing instance is in each class.
Note that the order of classes here is the same as 'Label' field
in the model structure.

I am sorry to tell that all answers are totally wrong!!
The main error done in the code is:
numLabels = max(labels);
because it returns (1), although it should return 2 if the labels are positive numbers, and then svmtrain/svmpredict will loop twice.
Anyway, change line labels=[-1;-1;-1;-1;-1;1;1;1];
to labels=[2;2;2;2;2;1;1;1];
and it will work successfully ;)

tensile tests in matlab

The problem says:
Three tensile tests were carried out on an aluminum bar. In each test the strain was measured at the same values of stress. The results were
where the units of strain are mm/m.Use linear regression to estimate the modulus of elasticity of the bar (modulus of elasticity = stress/strain).
I used this program for this problem:
function coeff = polynFit(xData,yData,m)
% Returns the coefficients of the polynomial
% a(1)*x^(m-1) + a(2)*x^(m-2) + ... + a(m)
% that fits the data points in the least squares sense.
% USAGE: coeff = polynFit(xData,yData,m)
% xData = x-coordinates of data points.
% yData = y-coordinates of data points.
A = zeros(m); b = zeros(m,1); s = zeros(2*m-1,1);
for i = 1:length(xData)
temp = yData(i);
for j = 1:m
b(j) = b(j) + temp;
temp = temp*xData(i);
end
temp = 1;
for j = 1:2*m-1
s(j) = s(j) + temp;
temp = temp*xData(i);
end
end
for i = 1:m
for j = 1:m
A(i,j) = s(i+j-1);
end
end
% Rearrange coefficients so that coefficient
% of x^(m-1) is first
coeff = flipdim(gaussPiv(A,b),1);
The problem is solved without a program as follows
MY ATTEMPT
T=[34.5,69,103.5,138];
D1=[.46,.95,1.48,1.93];
D2=[.34,1.02,1.51,2.09];
D3=[.73,1.1,1.62,2.12];
Mod1=T./D1;
Mod2=T./D2;
Mod3=T./D3;
xData=T;
yData1=Mod1;
yData2=Mod2;
yData3=Mod3;
coeff1 = polynFit(xData,yData1,2);
coeff2 = polynFit(xData,yData2,2);
coeff3 = polynFit(xData,yData3,2);
x1=(0:.5:190);
y1=coeff1(2)+coeff1(1)*x1;
subplot(1,3,1);
plot(x1,y1,xData,yData1,'o');
y2=coeff2(2)+coeff2(1)*x1;
subplot(1,3,2);
plot(x1,y2,xData,yData2,'o');
y3=coeff3(2)+coeff3(1)*x1;
subplot(1,3,3);
plot(x1,y3,xData,yData3,'o');
What do I have to do to get this result?

As a general advice:
avoid for loops wherever possible.
avoid using i and j as variable names, as they are Matlab built-in names for the imaginary unit (I really hope that disappears in a future release...)
Due to m being an interpreted language, for-loops can be very slow compared to their compiled alternatives. Matlab is named MATtrix LABoratory, meaning it is highly optimized for matrix/array operations. Usually, when there is an operation that cannot be done without a loop, Matlab has a built-in function for it that runs way way faster than a for-loop in Matlab ever will. For example: computing the mean of elements in an array: mean(x). The sum of all elements in an array: sum(x). The standard deviation of elements in an array: std(x). etc. Matlab's power comes from these built-in functions.
So, your problem. You have a linear regression problem. The easiest way in Matlab to solve this problem is this:
%# your data
stress = [ %# in Pa
34.5 69 103.5 138] * 1e6;
strain = [ %# in m/m
0.46 0.95 1.48 1.93
0.34 1.02 1.51 2.09
0.73 1.10 1.62 2.12]' * 1e-3;
%# make linear array for the data
yy = strain(:);
xx = repmat(stress(:), size(strain,2),1);
%# re-formulate the problem into linear system Ax = b
A = [xx ones(size(xx))];
b = yy;
%# solve the linear system
x = A\b;
%# modulus of elasticity is coefficient
%# NOTE: y-offset is relatively small and can be ignored)
E = 1/x(1)
What you did in the function polynFit is done by A\b, but the \-operator is capable of doing it way faster, way more robust and way more flexible than what you tried to do yourself. I'm not saying you shouldn't try to make these thing yourself (please keep on doing that, you learn a lot from it!), I'm saying that for the "real" results, always use the \-operator (and check your own results against it as well).
The backslash operator (type help \ on the command prompt) is extremely useful in many situations, and I advise you learn it and learn it well.
I leave you with this: here's how I would write your polynFit function:
function coeff = polynFit(X,Y,m)
if numel(X) ~= numel(X)
error('polynFit:size_mismathc',...
'number of elements in matrices X and Y must be equal.');
end
%# bad condition number, rank errors, etc. taken care of by \
coeff = bsxfun(#power, X(:), m:-1:0) \ Y(:);
end
I leave it up to you to figure out how this works.

Retraining after Cross Validation with libsvm

I know that Cross validation is used for selecting good parameters. After finding them, i need to re-train the whole data without the -v option.
But the problem i face is that after i train with -v option, i get the cross-validation accuracy( e.g 85%). There is no model and i can't see the values of C and gamma. In that case how do i retrain?
Btw i applying 10 fold cross validation.
e.g
optimization finished, #iter = 138
nu = 0.612233
obj = -90.291046, rho = -0.367013
nSV = 165, nBSV = 128
Total nSV = 165
Cross Validation Accuracy = 98.1273%
Need some help on it..
To get the best C and gamma, i use this code that is available in the LIBSVM FAQ
bestcv = 0;
for log2c = -6:10,
for log2g = -6:3,
cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
cv = svmtrain(TrainLabel,TrainVec, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
end
fprintf('(best c=%g, g=%g, rate=%g)\n',bestc, bestg, bestcv);
end
end
Another question : Is that cross-validation accuracy after using -v option similar to that we get when we train without -v option and use that model to predict? are the two accuracy similar?
Another question : Cross-validation basically improves the accuracy of the model by avoiding the overfitting. So, it needs to have a model in place before it can improve. Am i right? Besides that, if i have a different model, then the cross-validation accuracy will be different? Am i right?
One more question: In the cross-validation accuracy, what is the value of C and gamma then?
The graph is something like this
Then the values of C are 2 and gamma = 0.0078125. But when i retrain the model with the new parameters. The value is not the same as 99.63%. Could there be any reason?
Thanks in advance...

The -v option here is really meant to be used as a way to avoid the overfitting problem (instead of using the whole data for training, perform an N-fold cross-validation training on N-1 folds and testing on the remaining fold, one at-a-time, then report the average accuracy). Thus it only returns the cross-validation accuracy (assuming you have a classification problem, otherwise mean-squared error for regression) as a scalar number instead of an actual SVM model.
If you want to perform model selection, you have to implement a grid search using cross-validation (similar to the grid.py helper python script), to find the best values of C and gamma.
This shouldn't be hard to implement: create a grid of values using MESHGRID, iterate overall all pairs (C,gamma) training an SVM model with say 5-fold cross-validation, and choosing the values with the best CV-accuracy...
Example:
%# read some training data
[labels,data] = libsvmread('./heart_scale');
%# grid of parameters
folds = 5;
[C,gamma] = meshgrid(-5:2:15, -15:2:3);
%# grid search, and cross-validation
cv_acc = zeros(numel(C),1);
for i=1:numel(C)
cv_acc(i) = svmtrain(labels, data, ...
sprintf('-c %f -g %f -v %d', 2^C(i), 2^gamma(i), folds));
end
%# pair (C,gamma) with best accuracy
[~,idx] = max(cv_acc);
%# contour plot of paramter selection
contour(C, gamma, reshape(cv_acc,size(C))), colorbar
hold on
plot(C(idx), gamma(idx), 'rx')
text(C(idx), gamma(idx), sprintf('Acc = %.2f %%',cv_acc(idx)), ...
'HorizontalAlign','left', 'VerticalAlign','top')
hold off
xlabel('log_2(C)'), ylabel('log_2(\gamma)'), title('Cross-Validation Accuracy')
%# now you can train you model using best_C and best_gamma
best_C = 2^C(idx);
best_gamma = 2^gamma(idx);
%# ...

If you use your entire dataset to determine your parameters, then train on that dataset, you are going to overfit your data. Ideally, you would divide the dataset, do the parameter search on a portion (with CV), then use the other portion to train and test with CV. Will you get better results if you use the whole dataset for both? Of course, but your model is likely to not generalize well. If you want determine true performance of your model, you need to do parameter selection separately.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse