Unable to make sense of the confusion matrix returned by SVM - matlab

I am trying to understand why the SVM classifier is not able to correctly classify my data. I have presented 10 samples XX only out of 2000 samples of my original data. I cannot make sense of the confusion matrix returned by Matlab. I used SVM classifier. Is my code wrong, especially the way I did cross-validation?
XX is normalized to X, and Y is the label. Each feature vector is of length 8.
**Question **) Can somebody please help how to tackle this issue?
pred 0 pred 1
actual 0 100 0
actual 1 100 0
Thank you

You have:
an unbalanced data set (7 and 3 samples),
an 8-dimensional feature space and only 7 and 3 samples, which are very much insufficient to fill it (see curse of dimensionality), and
you're only using half those samples to train, meaning you're even further away from filling the feature space.
Thus, I am not surprised that the generalization that the SVM came up with is to classify everything as "class 0".
Try using only one of the features (first column of XX), and use leave-one-out cross validation.

Related

Learning curves for neural networks

I am trying to find optimal parameters of my neural network model implemented on octave, this model is used for binary classification and 122 features (inputs) and 25 hidden units (1 hidden layer). For this I have 4 matrices/ Vectors:
size(X_Train): 125973 x 122
size(Y_Train): 125973 x 1
size(X_Test): 22543 x 122
size(Y_test): 22543 x 1
I have used 20% of the training set to generate a validation set (XVal and YVal)
size(X): 100778 x 122
size(Y): 100778 x 1
size(XVal): 25195 x 122
size(YVal): 25195 x 1
size(X_Test): 22543 x 122
size(Y_test): 22543 x 1
The goal is to generate the Learning curves of the NN. I have learned (the hard way xD) that this is very time consuming because I used the full size of Xval and X for this.
I don't know if there is an alternative solution for this. I am thinking to reduce the size of the training vector X (like 5000 samples for example), but I don't know if I can do that, or if the results will be biased since I'll only use a portion of the training set?
Bests,
The total number of parameters above is around 3k (122*25 + 25*1), which is not huge for one example. Since the number of examples is large, you might want to use stochastic gradient descent or mini-batches instead of gradient descent.
Note that Matlab and Octave are slow in general, specially with loops.
You need to write the code which uses matrix operations rather than loops for the speed to be manageable in Matlab/Octave.

Anomaly in accuracy calculation

I am classifying a dataset with four classes using pretrained VGG19. To calculate accuracy, I used this formula:
accuracy = sum(predictedLabels==testLabels)/numel(predictedLabels) --Eq 1
Then I calculated the confusion matrix using:
confMat = confusionmat(testLabels, predictedLabels) **--Eq 2**
From which I got a matrix with 4 rows and 4 columns since I had 4 classes.
Now, we know that the accuracy formula is also:
Accuracy=TP+TN/(TP+TN+FP+FN) **Eq-3**
So I also calculated Accuracy from my confusion matrix formed through above Eq. 2. where
TP=value in (row==column),
FP=sum of column-TP,
FN=sum of row-TP,
TN=sum of the diagonal-TP
If I am doing above steps alright, then my confusion is that I am getting different accuracies from two methods Eq 1 and Eq 3. The accuracy I am getting with Eq. 1 is equivalent to the formula TP/(TP+TN). so, If this is the case, then Eq. 1 is the wrong formula for calculating accuracy. But, this formula has been used across all matlab deep learning codes.
So, MATLAB is doing something wrong (which has the probability 0, I know) or I am doing something wrong. But, unfortunately, I am unable to pinpoint my mistake.
Now, the question is,
Am I doing it wrong? Where am I missing the step? How to correct it? What is the logical explanation of this anomaly?
EDIT
This anomaly in accuracy calculation happens due to class imbalance problem. that is when, there are different number of samples in each class. therefore, the regular accuracy formula in Eq. 3 will not work in such cases.
The main issue is that negative and positive is for prediction (is this a cat or not), while you are doing classification with more than two categories. The classifier doesn't give you positive and negative (for is it a cat prediction), so it is not possible to relate to answers as true positive or false positive etc. Therefore equation 3 is meaningless, and so is the method for computing TP, TN etc. For example, if TP is row=column as you defined, then these are the accurate values in the diagonal of confMat. But what is TN? According to your definition it is TP (which is the diagonal) minus the diagonal. I hope this helps put things on the write track.

The size of my PCA coefficients is not correct [duplicate]

I want to select the top N=10,000 principal components from a matrix. After the pca is completed, MATLAB should return a pxp matrix, but it doesn't!
>> size(train_data)
ans =
400 153600
>> [coefs,scores,variances] = pca(train_data);
>> size(coefs)
ans =
153600 399
>> size(scores)
ans =
400 399
>> size(variances)
ans =
399 1
It should be coefs:153600 x 153600? and scores:400 X 153600?
When I use the below code it gives me an Out of Memory error::
>> [V D] = eig(cov(train_data));
Out of memory. Type HELP MEMORY for your options.
Error in cov (line 96)
xy = (xc' * xc) / (m-1);
I don't understand why MATLAB returns a lesser dimensional matrix. It
should return an error with pca: 153600*153600*8 bytes=188 GB
Error with eigs:
>> eigs(cov(train_data));
Out of memory. Type HELP MEMORY for your options.
Error in cov (line 96)
xy = (xc' * xc) / (m-1);
Foreword
I think you are falling prey to the XY problem, since trying to find 153.600 dimensions in your data is completely non-physical, please ask about the problem (X) and not your proposed solution (Y) in order to get a meaningful answer. I will use this post only to tell you why PCA is not a good fit in this case. I cannot tell you what will solve your problem, since you have not told us what that is.
This is a mathematically unsound problem, as I will try to explain here.
PCA
PCA is, as user3149915 said, a way to reduce dimensions. This means that somewhere in your problem you have one-hundred-fifty-three-thousand-six-hundred dimensions floating around. That's a lot. A heck of a lot. Explaining a physical reason for the existence of all of them might be a bigger problem than trying to solve the mathematical problem.
Trying to fit that many dimensions to only 400 observations will not work, since even if all observations are linear independent vectors in your feature space, you can still extract only 399 dimensions, since the rest simply cannot be found since there are no observations. You can at most fit N-1 unique dimensions through N points, the other dimensions have an infinite number of possibilities of location. Like trying to fit a plane through two points: there's a line you can fit through those and the third dimension will be perpendicular to that line, but undefined in the rotational direction. Hence, you are left with an infinite number of possible planes that fit through those two points.
After the first 400 components, there's no more dimensions left. You are fitting a void after that. You used all your data to get the dimensions and cannot create more dimensions. Impossible. All you can do is get more observations, some 1.5M, and do the PCA again.
More observations than dimensions
Why do you need more observations than dimensions? you might ask. Easy, you cannot fit a unique line through a point, nor a unique plane through two points, nor a unique 153.600 dimensional hyperplane through 400 points.
So, if I get 153.600 observations I'm set?
Sadly, no. If you have two points and fit a line through it you get a 100% fit. No error, jay! Done for the day, let's go home and watch TV! Sadly, your boss will call you in the next morning since your fit is rubbish. Why? Well, if you'd have for instance 20 points scattered around, the fit would not be without errors, but at least closer to representing your actual data, since the first two could be outliers, see this very illustrative figure, where the red points would be your first two observations:
If you were to extract the first 10.000 components, that'd be 399 exact fits and 9601 zero dimensions. Might as well not even attempt to calculate beyond the 399th dimension, and stick that into a zero array with 10.000 entries.
TL;DR You cannot use PCA and we cannot help you solve your problem as long as you do not tell us what your problem is.
PCA is a dimension reduction algorithm, as such it tries to reduce the number of features to principal components (PC) that each represents some linear combination of the total features. All of this is done in order to reduce the dimensions of the feature space, i.e. transform the large feature space to one that is more manageable but still retains most if not all of the information.
Now for your problem, you are trying to explain the variance across your 400 observations using 153600 features, however, we don't need that much information 399 PC's will explain 100% of the variance across your sample (I will be very surprised if that is not the case). The reason for that is basicly overfitting, your algorithm finds noise that explain every observation in your sample.
So what the rayryeng was telling you is correct, if you want to reduce your feature space to 10,000 PC's you will need 100,000 observations for the PC's to mean anything (that is a rule of thumb but a rather stable one).
And the reason that matlab was giving you 399 PC's because it was able to correctly extract 399 linear combinations that explained some #% of the variance across your sample.
If on the other hand what you are after are the most relevant features than you are not looking for dimensional reduction flows, but rather feature elimination processes. These will keep only the most relevant feature while nulling the irrelevant ones.
So just to make clear, if your feature space is rubbish and there isn't any information there just noise, the variance explained will be irrelevant and will indeed be less than 100% for example see the following
data = rand(400,401);
[coefs,scores,variances] = pca(data);
numel(variances)
disp('Var explained ' num2str(cumsum(variances)) '%'])
Again if you want to reduce your feature space there are ways to that even with a small m, but PCA is not one of them.
Good Luck
Matlab tries to not waste too much resources computing it.
But you still can do what you want, just use:
pca(train_data,'Economy','off')

PCA in matlab selecting top n components

I want to select the top N=10,000 principal components from a matrix. After the pca is completed, MATLAB should return a pxp matrix, but it doesn't!
>> size(train_data)
ans =
400 153600
>> [coefs,scores,variances] = pca(train_data);
>> size(coefs)
ans =
153600 399
>> size(scores)
ans =
400 399
>> size(variances)
ans =
399 1
It should be coefs:153600 x 153600? and scores:400 X 153600?
When I use the below code it gives me an Out of Memory error::
>> [V D] = eig(cov(train_data));
Out of memory. Type HELP MEMORY for your options.
Error in cov (line 96)
xy = (xc' * xc) / (m-1);
I don't understand why MATLAB returns a lesser dimensional matrix. It
should return an error with pca: 153600*153600*8 bytes=188 GB
Error with eigs:
>> eigs(cov(train_data));
Out of memory. Type HELP MEMORY for your options.
Error in cov (line 96)
xy = (xc' * xc) / (m-1);
Foreword
I think you are falling prey to the XY problem, since trying to find 153.600 dimensions in your data is completely non-physical, please ask about the problem (X) and not your proposed solution (Y) in order to get a meaningful answer. I will use this post only to tell you why PCA is not a good fit in this case. I cannot tell you what will solve your problem, since you have not told us what that is.
This is a mathematically unsound problem, as I will try to explain here.
PCA
PCA is, as user3149915 said, a way to reduce dimensions. This means that somewhere in your problem you have one-hundred-fifty-three-thousand-six-hundred dimensions floating around. That's a lot. A heck of a lot. Explaining a physical reason for the existence of all of them might be a bigger problem than trying to solve the mathematical problem.
Trying to fit that many dimensions to only 400 observations will not work, since even if all observations are linear independent vectors in your feature space, you can still extract only 399 dimensions, since the rest simply cannot be found since there are no observations. You can at most fit N-1 unique dimensions through N points, the other dimensions have an infinite number of possibilities of location. Like trying to fit a plane through two points: there's a line you can fit through those and the third dimension will be perpendicular to that line, but undefined in the rotational direction. Hence, you are left with an infinite number of possible planes that fit through those two points.
After the first 400 components, there's no more dimensions left. You are fitting a void after that. You used all your data to get the dimensions and cannot create more dimensions. Impossible. All you can do is get more observations, some 1.5M, and do the PCA again.
More observations than dimensions
Why do you need more observations than dimensions? you might ask. Easy, you cannot fit a unique line through a point, nor a unique plane through two points, nor a unique 153.600 dimensional hyperplane through 400 points.
So, if I get 153.600 observations I'm set?
Sadly, no. If you have two points and fit a line through it you get a 100% fit. No error, jay! Done for the day, let's go home and watch TV! Sadly, your boss will call you in the next morning since your fit is rubbish. Why? Well, if you'd have for instance 20 points scattered around, the fit would not be without errors, but at least closer to representing your actual data, since the first two could be outliers, see this very illustrative figure, where the red points would be your first two observations:
If you were to extract the first 10.000 components, that'd be 399 exact fits and 9601 zero dimensions. Might as well not even attempt to calculate beyond the 399th dimension, and stick that into a zero array with 10.000 entries.
TL;DR You cannot use PCA and we cannot help you solve your problem as long as you do not tell us what your problem is.
PCA is a dimension reduction algorithm, as such it tries to reduce the number of features to principal components (PC) that each represents some linear combination of the total features. All of this is done in order to reduce the dimensions of the feature space, i.e. transform the large feature space to one that is more manageable but still retains most if not all of the information.
Now for your problem, you are trying to explain the variance across your 400 observations using 153600 features, however, we don't need that much information 399 PC's will explain 100% of the variance across your sample (I will be very surprised if that is not the case). The reason for that is basicly overfitting, your algorithm finds noise that explain every observation in your sample.
So what the rayryeng was telling you is correct, if you want to reduce your feature space to 10,000 PC's you will need 100,000 observations for the PC's to mean anything (that is a rule of thumb but a rather stable one).
And the reason that matlab was giving you 399 PC's because it was able to correctly extract 399 linear combinations that explained some #% of the variance across your sample.
If on the other hand what you are after are the most relevant features than you are not looking for dimensional reduction flows, but rather feature elimination processes. These will keep only the most relevant feature while nulling the irrelevant ones.
So just to make clear, if your feature space is rubbish and there isn't any information there just noise, the variance explained will be irrelevant and will indeed be less than 100% for example see the following
data = rand(400,401);
[coefs,scores,variances] = pca(data);
numel(variances)
disp('Var explained ' num2str(cumsum(variances)) '%'])
Again if you want to reduce your feature space there are ways to that even with a small m, but PCA is not one of them.
Good Luck
Matlab tries to not waste too much resources computing it.
But you still can do what you want, just use:
pca(train_data,'Economy','off')

Naive Bayes: the within-class variance in each feature of TRAINING must be positive

When trying to fit Naive Bayes:
training_data = sample; %
target_class = K8;
# train model
nb = NaiveBayes.fit(training_data, target_class);
# prediction
y = nb.predict(cluster3);
I get an error:
??? Error using ==> NaiveBayes.fit>gaussianFit at 535
The within-class variance in each feature of TRAINING
must be positive. The within-class variance in feature
2 5 6 in class normal. are not positive.
Error in ==> NaiveBayes.fit at 498
obj = gaussianFit(obj, training, gindex);
Can anyone shed light on this and how to solve it? Note that I have read a similar post here but I am not sure what to do? It seems as if its trying to fit based on columns rather than rows, the class variance should be based on the probability of each row belonging to a specific class. If I delete those columns then it works but obviously this isnt what I want to do.
Assuming that there is no bug anywhere in your code (or NaiveBayes code from mathworks), and again assuming that your training_data is in the form of NxD where there are N observations and D features, then columns 2, 5, and 6 are completely zero for at least a single class. This can happen if you have relatively small training data and high number of classes, in which a single class may be represented by a few observations. Since NaiveBayes by default treats all features as part of a normal distribution, it cannot work with a column that has zero variance for all features related to a single class. In other words, there is no way for NaiveBayes to find the parameters of the probability distribution by fitting a normal distribution to the features of that specific class (note: the default for distribution is normal).
Take a look at the nature of your features. If they seem to not follow a normal distribution within each class, then normal is not the option you want to use. Maybe your data is closer to a multinomial model mn:
nb = NaiveBayes.fit(training_data, target_class, 'Distribution', 'mn');