Matlab train svm regressor using LIBVM with a sparse matrix of text - matlab

I am conducting a Bag Of Words analysis where I have a sparse matrix of 80x60452, where 1:80 represent the number of days for the text collected on each day. This matrix represents a collection of tweets on a particular day. The experiment is part of a univariate time series prediction and im not to sure how to implement this.
I am using the Text Matrix Generator (TMG) to construct this sparse matrix which will be used agains a response column vector of price observations(80x1).
I will use a the first 70 observations to train the SVM regressor to predict future price instances given a test data set of the remaining instances.
For example:
T = (length(priceData)-10);
trainBoW = BoW(1:T,1:end);
testBoW = BoW(T+1:end,1:end);
trainPrice = pData(1:T);
testPrice = pData(T+1:end);
model = svmtrain(trainBoW, trainPrice,'-c 1 -g 0.07 -b 1');
Running this code throws the error:
Error using svmtrain (line 274)
SVMTRAIN only supports classification into two groups. GROUP contains 62 groups.
How can I train a SVM regression model using libsvm to predict future price instances ?

Related

SVM Classifications on set of images of digits in Matlab

I have to use SVM classifier on digits dataset. The dataset consists of images of digits 28x28 and a toal of 2000 images.
I tried to use svmtrain but the matlab gave an error that svmtrain has been removed. so now i am using fitcsvm.
My code is as below:
labelData = zeros(2000,1);
for i=1:1000
labelData(i,1)=1;
end
for j=1001:2000
labelData(j,1)=1;
end
SVMStruct =fitcsvm(trainingData,labelData)
%where training data is the set of images of digits.
I need to know how i can predict the outputs of test data using svm? Further is my code correct?
The function that you are looking for is predict. It takes the SVM-object as input followed by a data-matrix and returns the predicted labels.
Make sure that you do not train your model on all data but on a reasonable subset (usually 70%). You can use the cross-validation preparation:
% create cross-validation object
cvp = cvpartition(Lbl,'HoldOut',0.3);
% extract logical vectors for training and testing data
lgTrn = cvp.training;
lgTst = cvp.test;
% train SVM
mdl = fitcsvm(Dat(lgTrn,:),Lbl(lgTrn));
% test / predict SVM
Lbl_prd = predict(mdl,Dat(lgTst,:));
Note that your labeling produces a single vector of ones.
The reason why The Mathworks changed svmtrain to fitcsvm is conciseness. It is now clear whether it is "classification" (fitcsvm) or "regression" (fitrsvm).

Is transposing training set affects the results with SVM

I am working on human age claffication where i have to classify data into two classes namely Young and Old. As a classifier i am using SVM and this is what i did so far to parepare the data :
The TrainingSet where it's size is (11264, 284) : where each column corresponds to an observation (a person). This means that i have 284 persons for the training task and 11264 features.
The TestSet is also formated as the TrainingSet.
The Groups (labels) is a matrix Groups(284, 1) filled with (1) for Olds and (-1) for Youngs.
I trained SVM using matlab built-in function to have the `SvmStruct'.
SvmStruct = svmtrain(TrainingSet, Groups')
Then i introduce the TestSet using this matlab function in order to have the classification results.
SvmClassify = svmclassify(SvmStruct, TestSet)
After i reviewed the matlab help about SVM i deduced that the data have to be introduced to the SVM classifier in a way that the each row of the TrainingSet corresponds to an Observation (a person in my case) and each column corresponds to a feature. So what i was doing so far was transposing those matrices (TrainingSet and TestSet).
Is what i did was wrong and all the results i got are wrong?
I looked at the source code for svmtrain, and it transposes the training data if the number of groups does not match the number of rows (svmtrain.m, line 249 ff, MATLAB 2015b):
% make sure data is the right size
if size(training,1) ~= size(groupIndex,1)
if size(training,2) == size(groupIndex,1)
training = training';
else
error(message('stats:svmtrain:DataGroupSizeMismatch'))
end
end
So no, your training results are not wrong.
However, svmclassify does not transpose the test data, it only checks for the right number of features (svmclassify.m, line 63 ff.):
if size(sample,2)~=size(svmStruct.SupportVectors,2)
error(message('stats:svmclassify:TestSizeMismatch'));
end
So this should have triggered an error (sample is your TestSet).

Matlab predict function not working

I am trying to train a linear SVM on a data which has 100 dimensions. I have 80 instances for training. I train the SVM using fitcsvm function in MATLAB and check the function using predict on the training data. When I classify the training data with the SVM all the data points are being classified into only one class.
SVM = fitcsvm(votes,b,'ClassNames',unique(b)');
predict(SVM,votes);
This gives outputs as all 0's which corresponds to 0th class. b contains 1's and 0's indicating the class to which each data point belongs.
The data used, i.e. matrix votes and vector b are given the following link
Make sure you use a non-linear kernel, such as a gaussian kernel and that the parameters of the kernel are tweaked. Just as a starting point:
SVM = fitcsvm(votes,b,'KernelFunction','RBF', 'KernelScale','auto');
bp = predict(SVM,votes);
that said you should split your set in a training set and a testing set, otherwise you risk overfitting

Grouping variable must be a vector Error in KNN Classifier

I am working on KNN classifier using matlab's function:
knnclassify(gp,trainingClass, gpTest),
where
gp is <849x36 double> matrix , gpTest is matrix to test but it raises the following error
Error using grp2idx (line 39) Grouping variable must be a vector or a
character array.
Error in knnclassify (line 81) [gindex,groups] = grp2idx(group);
Error in test (line 1) knnclassify(gp,trainingClass, gpTest);
The error is fairly clear - the gpTest variable should be a vector with the same length as the training data (trainingClass) containing a group label describing each training sample. This can be either numeric or a character array.
To clarify this, knnclassify (in its simplest form) is defined as
CLASS = knnclassify(SAMPLE,TRAINING,GROUP)
Where SAMPLE contains the n points that you want to classify based on the m training samples in TRAINING, each of which are defined as belonging to a class given in GROUP. The classifier will then predict the class of each of the n samples in SAMPLE based on the k nearest neighbours in the training data in TRAINING. SAMPLE and TRAINING should contain the same number of columns. By default, k is 1, so it will classify each point based on the nearest training sample using the Euclidean distance.

Implementing Logistic Regression in MATLAB

I have a data set of 13 attributes where some are categorical and some are continuous (can be converted to categorical). I need to use logistic regression to create a model that predicts the responses of a row and find the prediction's accuracy, sensitivity, and specificity.
Can/Should I use cross validation to divide my data set and get the results?
Is there any code sample on how to go about doing this? (I'm new to all of this)
Should I be using mnrfit/mnrval or glmfit/glmval? What's the difference and how do I choose?
Thanks!
If you want to determine how well the model can predict unseen data you can use cross validation. In Matlab, you can use glmfit to fit the logistic regression model and glmval to test it.
Here is a sample of Matlab code that illustrates how to do it, where X is the feature matrix and Labels is the class label for each case, num_shuffles is the number of repetitions of the cross-validation while num_folds is the number of folds:
for j = 1:num_shuffles
indices = crossvalind('Kfold',Labels,num_folds);
for i = 1:num_folds
test = (indices == i); train = ~test;
[b,dev,stats] = glmfit(X(train,:),Labels(train),'binomial','logit'); % Logistic regression
Fit(j,i) = glmval(b,X(test,:),'logit')';
end
end
Fit is then the fitted logistic regression estimate for each test fold. Thresholding this will yield an estimate of the predicted class for each test case. Performance measures are then calculated by comparing the predicted class label against the actual class label. Averaging the performance measures across all folds and repetitions gives an estimate of the model performance on unseen data.
originally answered by BGreene on #Stats.SE.