How to scale input features for SVM classification? - matlab

I am trying to perform a two-class classification using SVM in MATLAB. The two classes are 'Normal' and 'Infected' for classifying cell images into Normal or Infected respectively.
I use a training set which consists of 1000 Normal cell images and 300 Infected cell images. I extract 72 features from each of these cells. So my training feature set matrix is 72x1300 where each row represents a features and each column represents the corresponding feature value measured from the corresponding image.
data: 72x1300 double
My class label vector is initialized as:
cellLabel(1:1000) = {'normal'};
cellLabel(1001:1300) = {'infected'};
As suusgested in these links:
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf and svm scaling input values, I set about scaling the feature values doing this:
for i=1:1:size(data,1)
mu(i) = mean(data(:,i));
sd(i) = std(data(:,i));
scaledData(:,i) = (data(:,i) - mu(i))./sd(i);
end
For testing, I read a test image and compute a 72x1 feature vector. Before I classify, I scale the test vector using the corresponding mean and standard deviation values from the `data' and then classify. If I do this, I am getting a 0% training accuracy.
However, if I scale each from each class separately and concatenate, I am getting a 98% training accuracy. Can someone explain if my method is correct? For training accuracy, I knew what image I was using and hence read the mean and SD value. How should I do it for a case where the image's label is unknown?
This is how I train:
[idx,z] = rankfeatures(data,cellLabel,'Criterion','wilcoxon','NUMBER',7);
rnkData = data(idx,:);
rnkData = rnkData';
cellLabel = cellLabel';
SVMModel = fitcsvm(rnkData,cellLabel,'Standardize',true,'KernelFunction','RBF','KernelScale','auto');
You can see I tried using the in-built scaling property but the classification tends to show 'normal' class irrespective of the input.

corresponding mean and standard deviation values
What do you mean by that?
Do you have mean and std. dev. for each feature? Why not using actual min/max than?
I'm not sure how feasible is this to implement in Matlab, but in my OpenCV/SVM code I store all min/max values from the training data for each feature and use these min/max values to scale the test data of a corresponding feature.
If the value of test-data is often outside the range of min/max from training data, this is a strong hint of insufficient amount of training data.
Using mean and std. dev. values you won't detect this so explicitly.

Related

Principal component analysis in matlab?

I have a training set with the size of (size(X_Training)=122 x 125937).
122 is the number of features
and 125937 is the sample size.
From my little understanding, PCA is useful when you want to reduce the dimension of the features. Meaning, I should reduce 122 to a smaller number.
But when I use in matlab:
X_new = pca(X_Training)
I get a matrix of size 125973x121, I am really confused, because this not only changes the features but also the sample size? This is a big problem for me, because I still have the target vector Y_Training that I want to use for my neural network.
Any help? Did I badly interpret the results? I only want to reduce the number of features.
Firstly, the documentation of the PCA function is useful: https://www.mathworks.com/help/stats/pca.html. It mentions that the rows are the samples while the columns are the features. This means you need to transpose your matrix first.
Secondly, you need to specify the number of dimensions to reduce to a priori. The PCA function does not do that for you automatically. Therefore, in addition to extracting the principal coefficients for each component, you also need to extract the scores as well. Once you have this, you simply subset into the scores and perform the reprojection into the reduced space.
In other words:
n_components = 10; % Change to however you see fit.
[coeff, score] = pca(X_training.');
X_reduce = score(:, 1:n_components);
X_reduce will be the dimensionality reduced feature set with the total number of columns being the total number of reduced features. Also notice that the number of training examples does not change as we expect. If you want to make sure that the number of features are along the rows instead of the columns after we reduce the number of features, transpose this output matrix as well before you proceed.
Finally, if you want to automatically determine the number of features to reduce to, one method to do so is to calculate the variance explained of each feature, then accumulate the values from the first feature up to the point where we exceed some threshold. Usually 95% is used.
Therefore, you need to provide additional output variables to capture these:
[coeff, score, latent, tsquared, explained, mu] = pca(X_training.');
I'll let you go through the documentation to understand the other variables, but the one you're looking at is the explained variable. What you should do is find the point where the total variance explained exceeds 95%:
[~,n_components] = max(cumsum(explained) >= 95);
Finally, if you want to perform a reconstruction and see how well the reconstruction into the original feature space performs from the reduced feature, you need to perform a reprojection into the original space:
X_reconstruct = bsxfun(#plus, score(:, 1:n_components) * coeff(:, 1:n_components).', mu);
mu are the means of each feature as a row vector. Therefore you need add this vector across all examples, so broadcasting is required and that's why bsxfun is used. If you're using MATLAB R2018b, this is now implicitly done when you use the addition operation.
X_reconstruct = score(:, 1:n_components) * coeff(:, 1:n_components).' + mu;

Classification using GMM with MATLAB

I'm trying to classify a testset using GMM. I have a trainset (n*4 matrix) with labels {1,2,3}, n means the number of training examples, which have 4 properties. And I also have a testset (m*4) to be classified.
My goal is to have a probability matrix (m*3) for each testing example giving each label P(x_test|labels). Just like soft clustering.
first, I create a GMM with k=9 components over the whole trainset. I know in some papers, the author create a GMM for each label in trainset. But I want to deal with the data from all of the classes.
GMModel = fitgmdist(trainset,k_component,'RegularizationValue',0.1,'Start','plus');
My problem is, I want to confirm the relationship P(component|labels)between components and labels. So I write a code as below, but not sure if it's right,
idx_ex_of_c1 = find(trainset_label==1);
idx_ex_of_c2 = find(trainset_label==2);
idx_ex_of_c3 = find(trainset_label==3);
[~,~,post] = cluster(GMModel,trainset);
cita_c_k = zeros(3,k_component);
for id_k = 1:k_component
cita_c_k(1,id_k) = sum(post(idx_ex_of_c1,id_k))/numel(idx_ex_of_c1);
cita_c_k(2,id_k) = sum(post(idx_ex_of_c2,id_k))/numel(idx_ex_of_c2);
cita_c_k(3,id_k) = sum(post(idx_ex_of_c3,id_k))/numel(idx_ex_of_c3);
end
cita_c_k is a (3*9) matrix to store the relationships. idx_ex_of_c1 is the index of examples, whose label is '1' in the trainset.
For the testing process. I first apply the GMModel to testset
[P,~] = posterior(GMModel,testset); % P is a m*9 matrix
And then, sum all components,
P_testset = P*cita_c_k';
[a,b] = max(P_testset,3);
imagesc(b);
The result is ok, But not good enough. Can anyone give me some tips?
Thanks!
You can take following steps:
Increase target error and/or use optimal network size in training, but over-training and network size increase usually won't help
Most important, shuffle training data while training and use only important data points for a label to train (ignore data points that may belong to more than one labels)
SEPARABILITY
Verify separability of data using properties using correlation.
Correlation of all data in a label (X) should be high (near to one)
Cross-correlation of all data in label (X) with data in label (!=X) should be low (near zero).
If you observe that data points in a label have low correlation and data points across labels have high correlation - It puts a question on selection of properties (there could be properties which actually won't make data separable). Being so do follows:
Add more relevant properties to data points and remove less relevant properties (technique to use this is PCA)
Use derived parameters like top frequency component etc. from data points to train rather than direct points
Use a time delay network to train time series (always)

KNN Classifier for simple digit recognition

Actually , i have an assignment where it is required to recognize individual decimal digits as a part of the text recognition process.I am already given a set of JPEG formatted images of some digits. Each image is of size 160 x 160 pixels.After checking some resources here i managed to write this code but :
1)I am not sure if reading the images and resizing them in matrices for holding them is right or not.
2)Supposing that i have 30 train data images for numbers [0-9] each number has three images and i have 10 images for test each image is of only one digit.How to calculate distance between every test and train in a loop ? Because in my part of code for calculating Euclidean it gives an output zero.
3)How to calculate accuracy by using confusion matrix ?
% number of train data
Train = 30;
%number of test data
Test =10;
% to store my images
tData = uint8(zeros(160,160,30));
tTest = uint8(zeros(160,160,10));
for k=1:Test
s1='im-';
s2=num2str(k);
t = strcat('testy/im-',num2str(k),'.jpg');
im=rgb2gray(imread(t));
I=imresize(im,[160 160]);
tTest(:,:,k)=I;
%case testing if it belongs to zero
for l=1:3
ss1='zero-';
ss2=num2str(l);
t1 = strcat('data/zero-',num2str(l),'.jpg');
im1=rgb2gray(imread(t1));
I1=imresize(im1,[160 160]);
tData(:,:,l)=I1;
% Euclidean distance
distance= sqrt(sum(bsxfun(#minus, tData(:,:,k), tTest(:,:,l)).^2, 2));
[d,index] = sort(distance);
%k=3
% index_close(l) = index(l:3);
%x_close = I(index_close,:);
end
end
First of all i think 10 test data is not enough.
Just use the below function, data_test is your training data() and data_label is their labels. re size your images to smaller sizes!
I think the default distance measure is Euclidean distance but you can choose other ways such as City-block method for example.
Class = knnclassify(data_test, data_train, lab_train, 11);
fprintf('11-NN Accuracy: %f\n', sum(Class == lab_test')/length(lab_test));
Class = knnclassify(data_test, data_train, lab_train, 1, 'cityblock');
fprintf('1-NN Accuracy (cityblock): %f\n', sum(Class == lab_test')/length(lab_test));
Ok now you have the overall accuracy but this is not a good measure, it's better to calculate the accuracy separately for each class and then calculate their mean.
you can calculate a specific class (id) accuracy like this,,
idLocations = (lab_test == id);
NumberOfId = sum(idLocations);
NumberOfCurrect =sum (lab_test (idLocations) == Class(idLocations));
NumberOfCurrect/NumberOfId %Class id accuracy
as your questions are:
1) image re-sizing does affects the accuracy of the whole process.
Ans: As you mentioned in your question your images are already of the size 160 by 160, imresize will not affect it, but if your image is too small in size say 60*60 it will perform interpolation to increase the spatial dimensions of the image, which may affects structure and shape of the digit, to tackle these kind of variability, your training data should have much more samples(at least 50 samples per class), and some pre-processing should be apply on data like de-skewing of the digit image.
2) euclidean distance is good measure but not the best to deal with these kind of problems, as its distribution is a spherical distribution it may give same distance for to different digits. if you are working in MATLAB beware of of variable casting, you are taking difference so both the variable should be double in nature. it may be one of the reason of wrong distance calculation. in this line distance= sqrt(sum(bsxfun(#minus, tData(:,:,k), tTest(:,:,l)).^2, 2)); you are summing matrices column wise so output of this will be a row vector(1 X 160) which have sum along each corner. i think it should be like this:distance= sqrt(sum(sum(bsxfun(#minus, tData(:,:,k), tTest(:,:,l)).^2, 2))); i have just added one more sum there for getting sum of differences for whole matrix try it whether it helps or not.
3) For checking accuracy of your classifier precisely you have to have a large training dataset,by the way, Confusion matrix created during the process of cross-validation, where you split your training data into training samples and testing samples, so you know output classes in both the sample, now perform classification process, prepare a matrix for num_classe X num_classes(in your case 10 X 10), where rows resembles actual classes and columns belongs to prediction. take a sample from test and predict output class, suppose your classifier predict 5 and sample's actual class is also 5 put +1 in the confusion_matrix(5,5); if your classifier have predicted it as 3, you should do +1 at confusion_matrix(5,3). finally add diagonal elements of the confusion_mat and divide it by the total number of the test samples. output will be accuracy of your classifier.
P.S. Try to have atleast 50 samples per class and during cross-validation divide the training data 85:10 ratio where 90% sample should be used for training and rest 10 % should be used for testing the classifier.
Hope it have helps you.
feel free to share your thoughts.
Thank You

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

gaussian mixture model probability matlab

I have a data of dimension 50x100000. (100000 features, each has a dimension of 50).
I would like to fit a gaussian mixture model using this data. I used the following code.
obj = gmdistribution.fit(X',3);
What I need is when I give a new data Y I should be able to get the likelihood probabilities $p(Y|\theta)$, where $\theta$ are the gaussing mixture model parameters.
I used the following code to get the probability values.
P = pdf(obj,X');
But I am getting very low values all are about 0. Whay it is happning? How can i get the appropreate probability values?
In one dimension, the maximum value of the pdf of the Gaussian distribution is 1/sqrt(2*PI). So in 50 dimensions, the maximum value is going to be 1/(sqrt(2*PI)^50) which is around 1E-20. So the values of the pdf are all going to be of that order of magnitude, or smaller.