I need to calculate the mutual information between various features for designing a classification model using logistic regression. I am facing following problems:
I need to divide my data into n bins having approximately equal number of samples. How can I achieve this in Matlab?
Should I perform the above discretization on raw data or normalized data?
Thanks.
I guess what you want to do is similar to cross validation, in Matlab you can use the function crossvalind that allow you to split your dataset.
I added the example shown on the page for splitting your data into 10 bins (called 10-fold cross validation).
load fisheriris
indices = crossvalind('Kfold',species,10);
cp = classperf(species);
for i = 1:10
test = (indices == i); train = ~test;
class = classify(meas(test,:),meas(train,:),species(train,:));
classperf(cp,class,test)
end
cp.ErrorRate
ans =
0.0200
You should do this operation after doing the pre-processing of your data (normalisation / standardisation).
Related
I have to use SVM classifier on digits dataset. The dataset consists of images of digits 28x28 and a toal of 2000 images.
I tried to use svmtrain but the matlab gave an error that svmtrain has been removed. so now i am using fitcsvm.
My code is as below:
labelData = zeros(2000,1);
for i=1:1000
labelData(i,1)=1;
end
for j=1001:2000
labelData(j,1)=1;
end
SVMStruct =fitcsvm(trainingData,labelData)
%where training data is the set of images of digits.
I need to know how i can predict the outputs of test data using svm? Further is my code correct?
The function that you are looking for is predict. It takes the SVM-object as input followed by a data-matrix and returns the predicted labels.
Make sure that you do not train your model on all data but on a reasonable subset (usually 70%). You can use the cross-validation preparation:
% create cross-validation object
cvp = cvpartition(Lbl,'HoldOut',0.3);
% extract logical vectors for training and testing data
lgTrn = cvp.training;
lgTst = cvp.test;
% train SVM
mdl = fitcsvm(Dat(lgTrn,:),Lbl(lgTrn));
% test / predict SVM
Lbl_prd = predict(mdl,Dat(lgTst,:));
Note that your labeling produces a single vector of ones.
The reason why The Mathworks changed svmtrain to fitcsvm is conciseness. It is now clear whether it is "classification" (fitcsvm) or "regression" (fitrsvm).
load fisheriris;
y = species; %label
X = meas;
%Create a random partition for a stratified 10-fold cross-validation.
c = cvpartition(y,'KFold',10);
% split training/testing sets
[trainIdx testIdx] = crossvalind('HoldOut', y, 0.6);
crossvalind is used to perform cross-validation by randomly splitting the entire feature set X into training and testing data by returning the indices. Using the indices, we can create train and test data as X(trainIdx,:) and X(testIdx,:) respectively. cvpartition also splits the data using methods such as stratified and non-stratified but it does not return the indices. I have not seen examples where crossvalind is a stratified or non-stratified technique.
Question: Can crossvalind and cvpartition be used together?
I want to do stratified cross-validation. But I don't understand how to divide the data sets into train and test and get the indices.
Cross-validation and train/test partitioning are two different ways of estimating the performance of a model, not different ways of building the model itself. Usually you should build a model using all the data that you have, but also use one of these techniques (which build and score one or more additional models using subsets of that data) to estimate how good the main model is likely to be.
Cross-validation averages the outcome of multiple train/test splits so is usually expected to give a more realistic i.e. more pessimistic estimate of model performance.
Of the two functions you mention,crossvalind appears to be specific to the Bioinformatics Toolbox and is rather old. The help for cvpartition gives an example of how to do a stratified cross-validation:
Examples
Use a 10-fold stratified cross validation to compute the
misclassification error for classify on iris data.
load('fisheriris');
CVO = cvpartition(species,'k',10);
err = zeros(CVO.NumTestSets,1);
for i = 1:CVO.NumTestSets
trIdx = CVO.training(i);
teIdx = CVO.test(i);
ytest = classify(meas(teIdx,:),meas(trIdx,:),...
species(trIdx,:));
err(i) = sum(~strcmp(ytest,species(teIdx)));
end
cvErr = sum(err)/sum(CVO.TestSize);
The phow_caltech101 demo app in vlfeat creates a complete Bag of Words process for image classification on the Caltech101 dataset, roughly put:
Feature Extraction
Visual Vocabulary building
Spatial Histograms computation
SVM training
SVM testing and evaluation,
obtaining a model that can be used to later classify new, unclassified instances.
The only problem the histograms computed are spatial histograms, this means if I have a visual vocabulary of size n, I would have expected the histogram to have size n x (size_collection), containing the ocurrences of each visual word in each training instance.
The spatial histograms, however, are stored in a structure according to the model specified, by default it has two spatial arguments, spatialX and spatialY, which results in a structure with size spatialX * spatialY * (size_vocabulary) which is later normalized and this is the one used to train the SVM.
Now, what if i want to use the normal histogram, normalized or not, but the histogram that gives me a 1-1 correspondance on visual word per image, or obtain this information from the spatial histogram? Also, how much more efficient is the use of the spatial histogram instead of the classical one I take into account when I picture the Bag of Words process?
Any help appreciated.
UPDATE:
Here is part of the code where the histograms are computed, you can see how instead of ending with a histogram vector of size (number_visual_words) you end up with a histogram of size (spatialX * spatialY * number_visual_words). Let me clarify, in this case, the model is defined to have spatialX = [2 4] and spatialY = [2 4].
for i = 1:length(model.numSpatialX)
binsx = vl_binsearch(linspace(1,width,model.numSpatialX(i)+1), frames(1,:)) ;
binsy = vl_binsearch(linspace(1,height,model.numSpatialY(i)+1), frames(2,:)) ;
% combined quantization
bins = sub2ind([model.numSpatialY(i), model.numSpatialX(i), numWords], ...
binsy,binsx,binsa) ;
hist = zeros(model.numSpatialY(i) * model.numSpatialX(i) * numWords, 1) ;
hist = vl_binsum(hist, ones(size(bins)), bins) ;
hists{i} = single(hist / sum(hist)) ;
end
hist = cat(1,hists{:}) ;
hist = hist / sum(hist) ;
And part of the problem is that I havent worked with spatial histogram either, so Im not sure how much better than "normal" histograms they are. Maybe someone who has worked with this kind of histograms before could give a more helpful insight.
I am new to Matlab. Is there any sample code for classifying some data (with 41 features) with a SVM and then visualize the result? I want to classify a data set (which has five classes) using the SVM method.
I read the "A Practical Guide to Support Vector Classication" article and I saw some examples. My dataset is kdd99. I wrote the following code:
%% Load Data
[data,colNames] = xlsread('TarainingDataset.xls');
groups = ismember(colNames(:,42),'normal.');
TrainInputs = data;
TrainTargets = groups;
%% Design SVM
C = 100;
svmstruct = svmtrain(TrainInputs,TrainTargets,...
'boxconstraint',C,...
'kernel_function','rbf',...
'rbf_sigma',0.5,...
'showplot','false');
%% Test SVM
[dataTset,colNamesTest] = xlsread('TestDataset.xls');
TestInputs = dataTset;
groups = ismember(colNamesTest(:,42),'normal.');
TestOutputs = svmclassify(svmstruct,TestInputs,'showplot','false');
but I don't know that how to get accuracy or mse of my classification, and I use showplot in my svmclassify but when is true, I get this warning:
The display option can only plot 2D training data
Could anyone please help me?
I recommend you to use another SVM toolbox,libsvm. The link is as follow:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
After adding it to the path of matlab, you can train and use you model like this:
model=svmtrain(train_label,train_feature,'-c 1 -g 0.07 -h 0');
% the parameters can be modified
[label, accuracy, probablity]=svmpredict(test_label,test_feaure,model);
train_label must be a vector,if there are more than two kinds of input(0/1),it will be an nSVM automatically.
train_feature is n*L matrix for n samples. You'd better preprocess the feature before using it. In the test part, they should be preprocess in the same way.
The accuracy you want will be showed when test is finished, but it's only for the whole dataset.
If you need the accuracy for positive and negative samples separately, you still should calculate by yourself using the label predicted.
Hope this will help you!
Your feature space has 41 dimensions, plotting more that 3 dimensions is impossible.
In order to better understand your data and the way SVM works is to begin with a linear SVM. This tybe of SVM is interpretable, which means that each of your 41 features has a weight (or 'importance') associated with it after training. You can then use plot3() with your data on 3 of the 'best' features from the linear svm. Note how well your data is separated with those features and choose a basis function and other parameters accordingly.
I have a data set of 13 attributes where some are categorical and some are continuous (can be converted to categorical). I need to use logistic regression to create a model that predicts the responses of a row and find the prediction's accuracy, sensitivity, and specificity.
Can/Should I use cross validation to divide my data set and get the results?
Is there any code sample on how to go about doing this? (I'm new to all of this)
Should I be using mnrfit/mnrval or glmfit/glmval? What's the difference and how do I choose?
Thanks!
If you want to determine how well the model can predict unseen data you can use cross validation. In Matlab, you can use glmfit to fit the logistic regression model and glmval to test it.
Here is a sample of Matlab code that illustrates how to do it, where X is the feature matrix and Labels is the class label for each case, num_shuffles is the number of repetitions of the cross-validation while num_folds is the number of folds:
for j = 1:num_shuffles
indices = crossvalind('Kfold',Labels,num_folds);
for i = 1:num_folds
test = (indices == i); train = ~test;
[b,dev,stats] = glmfit(X(train,:),Labels(train),'binomial','logit'); % Logistic regression
Fit(j,i) = glmval(b,X(test,:),'logit')';
end
end
Fit is then the fitted logistic regression estimate for each test fold. Thresholding this will yield an estimate of the predicted class for each test case. Performance measures are then calculated by comparing the predicted class label against the actual class label. Averaging the performance measures across all folds and repetitions gives an estimate of the model performance on unseen data.
originally answered by BGreene on #Stats.SE.