I want to use one-class classification using LibSVM in MATLAB.
I want to train data and use cross validation, but I don't know what I have to do to label the outliers.
If for example I have this data:
trainData = [1,1,1; 1,1,2; 1,1,1.5; 1,1.5,1; 20,2,3; 2,20,2; 2,20,5; 20,2,2];
labelTrainData = [-1 -1 -1 -1 0 0 0 0];
(The first four are examples of the 1 class, the other four are examples of outliers, just for the cross validation)
And I train the model using this:
model = svmtrain(labelTrainData, trainData , '-s 2 -t 0 -d 3 -g 2.0 -r 2.0 -n 0.5 -m 40.0 -c 0.0 -e 0.0010 -p 0.1 -v 2' );
I'm not sure which value use to label the 1-class data and what to use to the outliers. Does someone knows how to do this?.
Thanks in advance.
-Jessica
According to http://www.joint-research.org/wp-content/uploads/2011/07/lukashevich2009Using-One-class-SVM-Outliers-Detection.pdf "Due to the lack of class labels in
the one-class SVM, it is not possible to optimize the kernel
parameters using cross-validation".
However, according to the LIBSVM FAQ that is not quite correct:
Q: How do I choose parameters for one-class SVM as training data are in only one class?
You have pre-specified true positive rate in mind and then search for parameters which achieve similar cross-validation accuracy.
Furthermore the README for the libsvm source says of the input data:
"For classification, label is an integer indicating the class label ... For one-class SVM, it's not used so can be any number."
I think your outliers should not be included in the training data - libsvm will ignore the training labels. What you are trying to do is find a hypersphere that contains good data but not outliers. If you train with outliers in the data LIBSVM will try yo find a hypersphere that includes the outliers, which is exactly what you don't want. So you will need a training dataset without outliers, a validation dataset with outliers for choosing parameters, and a final test dataset to see whether your model generalizes.
Related
I'm trying to fit the model for binary classification and predict the probability of values belonging to these classes.
My first problem is that I can't interpret the results. I have a training set in whichlabels=0 and labels=1 (not -1 and +1).
I run the model:
vw train.vw -f model.vw --link=logistic
Next:
vw test.vw -t -i model.vw -p pred.txt
Then I have a file pred.txt with these values:
0.5
0.5111
0.5002
0.5093
0.5
I don't understand what mean 0.5? All value in pred.txt about 0.5. I wrote the script and deducted from results 0.5. I get this lines:
0
0.111
0.002
0.093
0
Is that my desired probability?
And here is my second problem - I have unbalanced target class. I have a 95% negative (0) and 5% positive results (1). How can I prescribe that VW made the imbalance of classes, like {class 0:0.1, class 1:0.9}?
Or it should be done when preparing dataset?
For binary classification in VW, the labels need to be converted (from 0 and 1) to -1 and +1, e.g. with sed -e 's/^0/-1/'.
In addition to --link=logistic you need to use also --loss_function=logistic if you want to interpret the predictions as probabilities.
For unbalanced classes, you need to use importance weighting and tune the importance weight constant on heldout set (or cross-validation) with some external evaluation metric of your choice (e.g. AUC or F1).
See also:
Calculating AUC when using Vowpal Wabbit
Vowpal Wabbit Logistic Regression
How to perform logistic regression using vowpal wabbit on very imbalanced dataset
I would like to draw learning curves for a given SVM classifier. Thus, in order to do this, I would like to compute the training, cross-validation and test error, and then plot them while varying some parameter (e.g., number of instances m).
How to compute training, cross-validation and test error on libsvm when used with MATLAB?
I have seen other answers (see example) that suggest solutions for other languages.
Isn't there a compact way of doing it?
Given a set of instances described by:
a set of features featureVector;
their corresponding labels (e.g., either 0 or 1),
if a model was previously inferred via libsvm, the MSE error can be computed as follows:
[predictedLabels, accuracy, ~] = svmpredict(labels, featureVectors, model,'-q');
MSE = accuracy(2);
Notice that predictedLabels contains the labels that were predicted by the classifier for the given instances.
I copied the code from Multi-class classification in libsvm to get the probability estimates for each class.
However, I have an error in that my probability estimates p only has one column when there should be two columns. I checked my model and it states there are two classes (model.nr_class = 2) and (model.Label = [0;1]).
Can someone explain?
My probability estimates range from -0.35 to 1.2057 so they are not between 0 and 1.
Giving the option -b 0 and -b 1 returns the same result.
I guess you miss the -b option in training. Check libsvm documentation.
as you are having 2 classes, its clear that if the prob of class A is 0.xxxx then prob of another class is 1-0.xxxx, that you can clearly calculate...
I'm using LIBSVM for matlab. When I use a regression SVM the probability estimates it outputs are an empty matrix, whereas this feature works fine when using classification. Is this a normal behavior, because in the LIBSVM readme it says:
-b probability_estimates: whether to train a SVC or SVR model for probability estimates,
0 or 1 (default 0)
[~,~,P] = svmpredict(x,y,model,'-b 1');
The output P is the probability of y belongs to class 1 and -1 respectively (m*2 array), and it only makes sense for classification problem.
For regression problem, the pairwise probability information is included in your trained model with model.ProbA.
I am new to Matlab. Is there any sample code for classifying some data (with 41 features) with a SVM and then visualize the result? I want to classify a data set (which has five classes) using the SVM method.
I read the "A Practical Guide to Support Vector Classication" article and I saw some examples. My dataset is kdd99. I wrote the following code:
%% Load Data
[data,colNames] = xlsread('TarainingDataset.xls');
groups = ismember(colNames(:,42),'normal.');
TrainInputs = data;
TrainTargets = groups;
%% Design SVM
C = 100;
svmstruct = svmtrain(TrainInputs,TrainTargets,...
'boxconstraint',C,...
'kernel_function','rbf',...
'rbf_sigma',0.5,...
'showplot','false');
%% Test SVM
[dataTset,colNamesTest] = xlsread('TestDataset.xls');
TestInputs = dataTset;
groups = ismember(colNamesTest(:,42),'normal.');
TestOutputs = svmclassify(svmstruct,TestInputs,'showplot','false');
but I don't know that how to get accuracy or mse of my classification, and I use showplot in my svmclassify but when is true, I get this warning:
The display option can only plot 2D training data
Could anyone please help me?
I recommend you to use another SVM toolbox,libsvm. The link is as follow:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
After adding it to the path of matlab, you can train and use you model like this:
model=svmtrain(train_label,train_feature,'-c 1 -g 0.07 -h 0');
% the parameters can be modified
[label, accuracy, probablity]=svmpredict(test_label,test_feaure,model);
train_label must be a vector,if there are more than two kinds of input(0/1),it will be an nSVM automatically.
train_feature is n*L matrix for n samples. You'd better preprocess the feature before using it. In the test part, they should be preprocess in the same way.
The accuracy you want will be showed when test is finished, but it's only for the whole dataset.
If you need the accuracy for positive and negative samples separately, you still should calculate by yourself using the label predicted.
Hope this will help you!
Your feature space has 41 dimensions, plotting more that 3 dimensions is impossible.
In order to better understand your data and the way SVM works is to begin with a linear SVM. This tybe of SVM is interpretable, which means that each of your 41 features has a weight (or 'importance') associated with it after training. You can then use plot3() with your data on 3 of the 'best' features from the linear svm. Note how well your data is separated with those features and choose a basis function and other parameters accordingly.