Why `libsvm` in matlab gives me all 1 prediction - matlab

I use svm in Rand matlab with the same dataset.
My R code works fine, which gives me some reasonable predictions.
matdat <- readMat(con = "data.mat")
svm.model <- svm(x = matdat$normalize.X, y = matdat$Yt)
pred <- predict(svm.model, newdata = matdat$normalize.X)
pred <- sapply(pred, function(x){ifelse(x > 0, 1, -1)})
sum(pred == matdat$Yt)/length(matdat$Yt)
But, my matlab code gives me all 1 prediction on the training data.
load('data.mat')
model2 = svmtrain(Yt, normalize_X,'-s 3 -c 1 -t 2 -p 0.1');
[predicted_label,accuracy, decision_values] = svmpredict(Yt, normalize_X, model2);
I have checked the default parameters of svm{e1071}, which in my opinion agrees with the matlab version.
I use the e1071 package with verion 1.6-7 in R. And the latest libsvm from the official page.
So, what can I do to find the reason, any ideas?
==== update====
Before feeding the data to libsvm in data, I apply mapstd to normalize the data which is automatically done in R. Then I got the same trained model in both R and Matlab.

In Matlab you use the -s 3 option which is regression, not classification.
As a starting point, don't assume anything about default parameters, just specify parameters explicitly in both R and Matlab.

Related

Why do the principal component values from Scipy and MATLAB not agree?

I was training to do some PCA reconstroctions of MNIST on python and compare them to my (old) reconstruction in maltab and I happened to discover that my reconstruction don't agree. After some debugging I decided to print a unique characteristic of the principal components of each one to reveal if they were the same and I discovered to my surprised that they were not the same. I printing the sum of all components and I got different numbers. I did the following in matlab:
[coeff, ~, ~, ~, ~, mu] = pca(X_train);
U = coeff(:,1:K)
U_fingerprint = sum(U(:))
%print 31.0244
and in python/scipy:
pca = pca.fit(X_train)
U = pca.components_
print 'U_fingerprint', np.sum(U)
# prints 12.814
why are the twi PCA's not computing the same value?
All my attempts and solving this issue:
The way I discovered this was because when I was reconstructing my MNIST images, the python reconstructions where much much closer to their original images by a lot. I got error of 0.0221556788645 in python while in MATLAB I got errors of size 29.07578. To figure out where the difference was coming from I decided to finger print the data sets (maybe they were normalized differently). So I got two independent copies the MNIST data set (that were normalized by dividing my 255) and got the finger prints (summing all numbers in data set):
print np.sum(x_train) # from keras
print np.sum(X_train)+np.sum(X_cv) # from TensorFlow
6.14628e+06
6146269.1585420668
which are (essentially) same (one copy from tensorflow MNIST and the other from Keras MNIST, note MNIST train data set has about 1000 less training set so you need to append the missing ones). To my surprise, my MATLAB data had the same finger print:
data_fingerprint = sum(X_train(:))
% prints data_fingerprint = 6.1463e+06
meaning the data sets are exactly the same. Good, so the normalization data is not the issue.
In my MATLAB script I am actually computing the reconstruction manually as follow:
U = coeff(:,1:K)
X_tilde_train = (U * U' * X_train);
train_error_PCA = (1/N_train)*norm( X_tilde_train - X_train ,'fro')^2
%train_error_PCA = 29.0759
so I thought that might be the problem because I was using the interface python gave for computing the reconstructions as in:
pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train) # M_train x K
#print 'X_pca' , X_pca.shape
X_reconstruct = pca.inverse_transform(X_pca)
print 'tensorflow error: ',(1.0/X_train.shape[0])*LA.norm(X_reconstruct_tf - X_train)
print 'keras error: ',(1.0/x_train.shape[0])*LA.norm(X_reconstruct_keras - x_train)
#tensorflow error: 0.0221556788645
#keras error: 0.0212030354818
which results in different error values 0.022 vs 29.07, shocking difference!
Thus, I decided to code that exact reconstruction formula in my python script:
pca = PCA(n_components=k)
pca = pca.fit(X_train)
U = pca.components_
print 'U_fingerprint', np.sum(U)
X_my_reconstruct = np.dot( U.T , np.dot(U, X_train.T) )
print 'U error: ',(1.0/X_train.shape[0])*LA.norm(X_reconstruct_tf - X_train)
# U error: 0.0221556788645
to my surprise, it has the same error as my MNIST error computing by using the interface. Thus, concluding that I don't have the misconception of PCA that I thought I had.
All that lead to me to check what the principal components actually where and to my surprise scipy and MATLAB have different fingerprint for their PCA values.
Does anyone know why or whats going on?
As warren suggested, the pca components (eigenvectors) might have different sign. After doing a finger print by adding all components in magnitude only I discovered they have the same finger print:
[coeff, ~, ~, ~, ~, mu] = pca(X_train);
K=12;
U = coeff(:,1:K)
U_fingerprint = sumabs(U(:))
% U_fingerprint = 190.8430
and for python:
k=12
pca = PCA(n_components=k)
pca = pca.fit(X_train)
print 'U_fingerprint', np.sum(np.absolute(U))
# U_fingerprint 190.843
which means the difference must be because of the different sign of the (pca) U vector. Which I find very surprising, I thought that should make a big difference, I didn't even consider it making a big difference. I guess I was wrong?
I don't know if this is the problem, but it certainly could be. Principal component vectors are like eigenvectors: if you multiply the vector by -1, it is still a valid PCA vector. Some of the vectors computed by matlab might have a different sign than those computed in python. That will result in very different sums.
For example, the matlab documentation has this example:
coeff = pca(ingredients)
coeff =
-0.0678 -0.6460 0.5673 0.5062
-0.6785 -0.0200 -0.5440 0.4933
0.0290 0.7553 0.4036 0.5156
0.7309 -0.1085 -0.4684 0.4844
I have my own python PCA code, and with the same input as in matlab, it produces this coefficient array:
[[ 0.0678 0.646 -0.5673 0.5062]
[ 0.6785 0.02 0.544 0.4933]
[-0.029 -0.7553 -0.4036 0.5156]
[-0.7309 0.1085 0.4684 0.4844]]
So, instead of simply summing the coefficient array, try summing the absolute values of the coefficients. Alternatively, ensure that all the vectors have the same sign convention before summing. You could do that by, say, multiplying each column by the sign of the first element in that column (assuming none of them are zero).

CIFAR-10 pixelwise training with libSVM matlab

Training the 50000 training images with feature vectors of 32x32x3 = 3072 dimensionality is making my computer get stuck. Is there a work around I'm missing to use libSVM efficiently for multiclass SVM classification? A day passes and the SVM is still running for only one class in a one-vs-all training framework.
*I am aware that using pixel values is a terrible way of optimal classification, yet I still want to run this as a lower bound benchmark for a study.
Code:
clc;close all;clear all;
addpath(genpath('./libsvm-3.21'));
addpath(genpath('./liblinear-2.1'));
%load all images:
M1 = load('../Data/cifar-10-batches-mat/data_batch_1.mat');
M2 = load('../Data/cifar-10-batches-mat/data_batch_2.mat');
M3 = load('../Data/cifar-10-batches-mat/data_batch_3.mat');
M4 = load('../Data/cifar-10-batches-mat/data_batch_4.mat');
M5 = load('../Data/cifar-10-batches-mat/data_batch_5.mat');
M = [M1.data; M2.data; M3.data; M4.data; M5.data];
M_labels = [M1.labels; M2.labels; M3.labels; M4.labels; M5.labels];
M_labels_double = double(M_labels);
M_double = double(M)/255.0;
%M_double is the dataset of [50000x3072]
%M_labels_double are the labels and has size of [50000x1]
model=cell(10,1);
for i=1:10
model{i} = svmtrain(double(M_labels_double==i),M_double,'-t 0 -c 1 -g 0.2 -b 1 -m 4000');
end

matlab libsvm: unable to predict

I am using libsvm on Matlab. I want to build a model and use this model for prediction.
It is wired that the returns of svmpredict ([predict_label, accuracy_all, prob_values]) are empty. Here is my simple code:
svm_model = svmtrain([train_label],[train],'-t 2, -c 100 -q');
[predict_label, accuracy_all, prob_values] = svmpredict(testlabels,testdata,svm_model,'-q, -b 1');
[predict_label, accuracy_all, prob_values] are 0x0 matrix. And also Matlab also shows some warning information:
Usage: [predicted_label, accuracy, decision_values/prob_estimates] = svmpredict(testing_label_vector, testing_instance_matrix, model, 'libsvm_options')
Parameters:
model: SVM model structure from svmtrain.
libsvm_options:
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); one-class SVM not supported yet
Returns:
predicted_label: SVM prediction output vector.
accuracy: a vector with accuracy, mean squared error, squared correlation coefficient.
prob_estimates: If selected, probability estimate vector.
Can anyone help me?
What is q in the SVM model ? where is it's value ?
there is two parameters in the SVM which need to be well defined c and g , you have put 100 as value of c but ther is no value for q (or must be g which is called gamma)
this what you need
cmd = ['-t 2 -c ',num2str(C), ' -g ',num2str(gamma) ];
model = svmtrain2(trainClass, trainData, cmd);
[predClass, acc, decVals] = svmpredict(testClass, testData, model);
Also i think that svmtrain must be renamed svmtrain2 to avoid the confusion with svmtrain function of matlab.

libsvm cross validation with precomputed kernel in matlab

I am trying to do a 5 fold cross validation with libsvm (matlab) using a precomputed kernel, but, I get the following error message :
Undefined function 'ge' for input arguments of type 'struct'.
this is because the Libsvm return a structure instead of a value in cross validation, How can I solve this problem, this is my code:
load('iris.dat')
data=iris(:,1:4);
class=iris(:,5);
% normalize the data
range=repmat((max(data)-min(data)),size(data,1),1);
data=(data-repmat(min(data),size(data,1),1))./range;
% train
tr_data=[data(1:5,:);data(52:56,:);data(101:105,:)];
tr_lbl=[ones(5,1);2*ones(5,1);3*ones(5,1)];
% kernel computation
sigma=.8
rbfKernel = #(X,Y,sigma) exp((-pdist2(X,Y,'euclidean').^2)./(2*sigma^2));
Ktr=[(1:15)',rbfKernel(tr_data,tr_data,sigma)];
kts=[ (1:150)',rbfKernel(data,tr_data,sigma)];
% svmptrain
bestcv = 0;
for log2c = -1:3
cmd = ['Ktr -t 4 -v 5 -c ', num2str(2^log2c)];
cv = svmtrain2(tr_lbl,tr_data, cmd);
if (cv >= bestcv)
bestcv = cv;
bestc = 2^log2c;
end
end
cmd=['-s 0 -c ', num2str(bestc), 'Ktr -t 4']
model=svmtrain2(tr_lbl,tr_data,cmd)
% svm predict
labels=svmpredict(class,data,model,kts)
The function svmtrain2 you are using is not part of standard MATLAB and also the output of the function is not a structure. But if you insist to use that, you can calculate an score for data using the other existing function:
[f,K] = svmeval(X_eval,varargin)
that evaluates the trained svm using the outputs from svmtrain2. But I prefer to use first the standard functions embedded in MATLAB. In standard MATLAB library there is:
SVMStruct = svmtrain(Training,Group)
that returns a structure, SVMStruct, containing information about the trained support vector machine (SVM) classifier. or
SVMModel = fitcsvm(X,Y)
that returns a support vector machine classifier SVMModel, trained by predictors X and class labels Y for one- or two-class classification. and then you can get some score for each prediction using:
[label,Score] = predict(SVMModel,X)
that returns class likelihood measures, i.e., either scores or posterior probabilities.
You get that error because you are trying to compare a struct and a number.
If what you want is to find the best performance in the training set (as it seems from you comparison), I don't think you can get it directly from the structure returned from svmtrain. You should first use svmpredict with the training set and the trained model, and you can get the accuracy from the resulting structure.

Test if a data distribution follows a Gaussian distribution in MATLAB

I have some data points and their mean point. I need to find whether those data points (with that mean) follows a Gaussian distribution. Is there a function in MATLAB which can do that kind of a test? Or do I need to write a test of my own?
I tried looking at different statistical functions provided by MATLAB. I am very new to MATLAB so I might have overlooked the right function.
cheers
Check this documentation page on all available hypothesis tests.
From those, for your purpose you can use:
Chi-square goodness-of-fit test
Lilliefors test
z-test
t-test
Kolmogorov-Smirnov test
... among others
You can also use some visual tests like:
hist
normplot
cdfplot
I like Spiegelhalter's test (D. J. Spiegelhalter, 'Diagnostic tests of distributional shape,' Biometrika, 1983):
function pval = spiegel_test(x)
% compute pvalue under null of x normally distributed;
% x should be a vector;
xm = mean(x);
xs = std(x);
xz = (x - xm) ./ xs;
xz2 = xz.^2;
N = sum(xz2 .* log(xz2));
n = numel(x);
ts = (N - 0.73 * n) / (0.8969 * sqrt(n)); %under the null, ts ~ N(0,1)
pval = 1 - abs(erf(ts / sqrt(2))); %2-sided test.
whenever hacking statistical tests, alway test them under the null! here's a simple example:
pvals = nan(10000,1);
for j=1:numel(pvals);
pvals(j) = spiegel_test(randn(300,1));
end
nnz(pvals < 0.05) ./ numel(pvals)
I get the results:
ans =
0.0505
Similarly
nnz(pvals > 0.95) ./ numel(pvals)
I get
ans =
0.0475
For testing in general, look up the Kolmogorov-Smirnov Test, also in the Stats Toolbox, as kstest and the two-sample version: kstest2 . You feed it your empirical data, (and the data from a possible function, like the gaussian, etc...) then it tests the likelihood that your sample was pulled from the normal distribution (or the one you supplied for the two-sample version)... The nicety is that it'll work for any possible distributions...