LIBSVM in MATLAB/Octave - what's the output of libsvmread? - matlab

The second output of the libsvmread command is a set of features for each given training example.
For example, in the following MATLAB command:
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
This second variable (heart_scale_inst) holds content in a form that I don't understand, for example:
<1, 1> -> 0.70833
What is the meaning of it? How is it to be used (I can't plot it, the way it is)?
PS. If anyone could please recommend a good LIBSVM tutorial, I'd appreciate it. I haven't found anything useful and the README file isn't very clear... Thanks.

The definitive tutorial for LIBSVM for beginners is called: A Practical Guide to SVM Classification it is available from the site of the authors of LIBSVM.
The second parameter returned is called the instance matrix. It is a matrix, let call it M, M(1,:) are the features of data point 1 and so on. The matrix is sparse that is why it prints out weirdly. If you want to see it fully print full(M).
[heart_scale_label, heart_scale_inst] = libsvmread('../heart_scale');
with heart_scale_label and heart_scale_inst you should be able to train an SVM by issuing:
mod = svmtrain(heart_scale_label,heart_scale_inst,'-c 1 -t 0');
I strong suggest you read the above linked guide to learn how to set the c parameter (and possibly, in case of RBF kernel the gamma parameter), but the above line is how you would train with that data.

I think it is the probability with which test case has been predicted to heart_scale label category

Related

No Model Summary For GLMs in Pyspark / SparkML

I'm familiarizing myself with Pyspark and SparkML at the moment. To do so I use the titanic dataset to train a GLM for predicting the 'Fare' in that dataset.
I'm following closely the Spark documentation. I do get a working model (which I call glm_fare) but when I try to assess the trained model using summary I get the following error message:
RuntimeError: No training summary available for this GeneralizedLinearRegressionModel
Why is this?
The code for training was as such:
glm_fare = GeneralizedLinearRegression(
labelCol="Fare",
featuresCol="features",
predictionCol='prediction',
family='gamma',
link='log',
weightCol='wght',
maxIter=20
)
glm_fit = glm_fare.fit(training_df)
glm_fit.summary
Just in case someone comes across this question, I ran into this problem as well and it seems that this error occurs when the Hessian matrix is not invertible. This matrix is used in the maximization of the likelihood for estimating the coefficients.
The matrix is not invertible if one of the eigenvalues is 0, which occurs when there is multicollinearity in your variables. This means that one of the variables can be predicted with a linear combination of the other variables. Consequently, the effect of each of the variables cannot be identified with any significance.
A possible solution would be to find the variables that are (multi)collinear and remove one of them from the regression. Note however that multicollinearity is only a problem if you want to interpret the coefficients and not when the model is used for prediction.
It is documented possibly there could be no summary available for a model in GeneralizedLinearRegressionModel docs.
However you can do an initial check to avoid the error:
glm_fit.hasSummary() which is a public boolean method.
Using it as
if glm_fit.hasSummary():
print(glm_fit.summary)
Here is a direct like to the Pyspark source code
and the GeneralizedLinearRegressionTrainingSummary class source code and where the error is thrown
Make sure your input variables for one hot encoder starts from 0.
One error I made that caused summary not created is, I put quarter(1,2,3,4) directly to one hot encoder, and get a vector of length 4, and one column is 0. I converted quarter to 0,1,2,3 and problem solved.

how to transform output SVM into probability using 'Platt scaling' method?

I worked on the problem of handwritten recognition images. For this, I use support vector machines as a classifier . the matrix score shows an example of the scores returned by svm for 5 samples. the number of classes is also 5. I want to transform this matrix into probabilities.
score=[ 0,2590 -0,6033 -1,1350 -1,2347 -0,9776
-1,4727 -0,2136 -0,9649 0,1480 -1,4761
-0,9637 -0,8662 0,0674 -1,0051 -1,1293
-2,1230 -0,8805 -0,9808 -0,0520 -0,0836
-1,6976 -1,1578 -0,9205 -1,1101 1,0796]
According to research on existing methods, I found that the Platt's scaling method is most appropriate in my case. I found an implementation of this method on this link Platt scaling but the problem is that I don't understand the third parameter to enter. Please, help me to understand this implementation and to make it executable
I await your answers and thank you in advance

Matlab libsvm svmpredict accuracy verbose

I have a question of an annoying fact. I'm using libsvm with matlab and I'am able to predict using:
predicted_label = svmpredict(Ylabel, Xlabel, model);
but it happen that every time I make a predictions appears this:
Accuracy = X% (y/n) (classification)
Which I find annoying because I am repeating this procedure a lot of times and also makes it slow because its displaying in screen.
I think what I want is to avoid that svmpredict being verbose.
Can anyone help me with this? Thanks in advance.
-Jessica
I found a much better approach than editing the source code of the c library was to use matlabs evalc which places any output to the first output argument.
[~ predicted_label] = evalc('svmpredict(Ylabel, Xlabel, model)');
Because the string to be evaluated is fixed should be no performance decrease.
svmpredict(Ylabel, Xlabel, model, '-q');
From the manual:
Usage: [predicted_label, accuracy, decision_values/prob_estimates] = svmpredict(testing_label_vector, testing_instance_matrix, model, 'libsvm_options')
[predicted_label] = svmpredict(testing_label_vector, testing_instance_matrix, model, 'libsvm_options')
Parameters:
model: SVM model structure from svmtrain.
libsvm_options:
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); one-class SVM not supported yet
-q : quiet mode (no outputs)
If you are using matlab, just find the line of code that is displaying this information (usually using 'disp', 'sprintf', or 'fprintf') and comment it out using the commenting operator %.
example:
disp(['Accuracy= ' num2str(x)]);
change it to:
% disp(['Accuracy= ' num2str(x)]);
If you are using the main libsvm library then you need to modify it before making.
1- Open the file 'svmpredict.c'
2- find this line of code:
info("Accuracy = %g%% (%d/%d) (classification)\n",
(double)correct/total*100,correct,total);
3- just comment it out using // operator
4- save and close the file
5- make the project

Problems in HMM toolbox

Recently I'm doing some training of HMM, I used the HMM toolbox. But I have some problems and couldn't resolve them.
I train my hmm as shown below. There's no problems here.
[LL, prior1, transmatrix1, observematrix1] = dhmm_em(data, prior0, transmatrix0, observematrix0);
I use the Viterbi algorithm to find the most-probable path through the HMM state trellis.
function path = viterbi_path(prior, transmat, obslik);
Now there's a problem. I don't know what the "obslik" means. Is it the observematrix1?
I want to get the probability of a sequence, but I don't know whether I should use the "fwdback" function or not. If I should, what the "obslik" means then?
function [alpha, beta, gamma, loglik, xi_summed, gamma2] = fwdback(init_state_distrib, transmat, obslik, varargin);
Thanks!!!
I didn't understand the comments. Now I understand it.
The "obslik" here isn't equal to the observematrix1. Before using Viterbi_path function, we should compute the obslik:
obslik = multinomial_prob(data(m,:), observematrix1);
the data matrix is the observematrix0, observe-matrix before training.
Am I right?

Matlab Weka Interface AdaBoost Issues: Out of Bounds Exception

I'm doing some cross-validation using a Matlab Weka Interface that I got from file exchange. My loop structure seems to work fine for Weka's Logistic classifier. However, when I try to do the exact same thing for AdaBoostM1, it throws the following error:
??? Java exception occurred: java.lang.ArrayIndexOutOfBoundsException
Error in ==> wekaClassify at 24 classProbs(t+1,:) = (classifier.distributionForInstance(testData.instance(t)))';
Error in ==> classifier_search at 225 [pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I have determined through some testing that this only occurs when the number of instances in the training set is greater than the number of instances in the test set. I am sure you can see why that is a problem for me, since in most situations the training set is greater than the test set in size.
Is there something different about how I should format my inputs when using Adaboost rather than Logistic? Any information you can give regarding this problem would be so helpful.
I downloaded this code from this page: http://www.mathworks.com/matlabcentral/fileexchange/21204-matlab-weka-interface
Emails bounce from the account of the guy who made it, and he doesn't seem to respond to comments on the page - I'm hoping that maybe someone here has used this.
EDIT: Here is the code that I use to train and test the classifier:
classifier = trainWekaClassifier(matlab2weka('training', featurelabels, train), 'meta.AdaBoostM1', { strcat('-P 100 -S 1 -I ', num2str(r), '-W weka.classifiers.trees.DecisionStump')});
[pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I haven't used this combination of software, so I can only take a guess at what could cause this.
Are your training/testing data matrices the right way round? They should be N-by-D (N instances, D features).
If you were passing in a D-by-N training matrix and a D-by-M testing matrix, then I would expect it to work only when M < N - which is what you describe - and even then, it wouldn't give a meaningful result.