Anomaly in accuracy calculation - matlab

I am classifying a dataset with four classes using pretrained VGG19. To calculate accuracy, I used this formula:
accuracy = sum(predictedLabels==testLabels)/numel(predictedLabels) --Eq 1
Then I calculated the confusion matrix using:
confMat = confusionmat(testLabels, predictedLabels) **--Eq 2**
From which I got a matrix with 4 rows and 4 columns since I had 4 classes.
Now, we know that the accuracy formula is also:
Accuracy=TP+TN/(TP+TN+FP+FN) **Eq-3**
So I also calculated Accuracy from my confusion matrix formed through above Eq. 2. where
TP=value in (row==column),
FP=sum of column-TP,
FN=sum of row-TP,
TN=sum of the diagonal-TP
If I am doing above steps alright, then my confusion is that I am getting different accuracies from two methods Eq 1 and Eq 3. The accuracy I am getting with Eq. 1 is equivalent to the formula TP/(TP+TN). so, If this is the case, then Eq. 1 is the wrong formula for calculating accuracy. But, this formula has been used across all matlab deep learning codes.
So, MATLAB is doing something wrong (which has the probability 0, I know) or I am doing something wrong. But, unfortunately, I am unable to pinpoint my mistake.
Now, the question is,
Am I doing it wrong? Where am I missing the step? How to correct it? What is the logical explanation of this anomaly?
EDIT
This anomaly in accuracy calculation happens due to class imbalance problem. that is when, there are different number of samples in each class. therefore, the regular accuracy formula in Eq. 3 will not work in such cases.

The main issue is that negative and positive is for prediction (is this a cat or not), while you are doing classification with more than two categories. The classifier doesn't give you positive and negative (for is it a cat prediction), so it is not possible to relate to answers as true positive or false positive etc. Therefore equation 3 is meaningless, and so is the method for computing TP, TN etc. For example, if TP is row=column as you defined, then these are the accurate values in the diagonal of confMat. But what is TN? According to your definition it is TP (which is the diagonal) minus the diagonal. I hope this helps put things on the write track.

Related

Rank of matrix contradicts the number of independent columns

I have 50x49 matrix A that has 49 linearly independent columns. However, my software (octave) tells me its rank is 44:
Is it due to some computational error? If so, then how to prevent such errors?
If the software was able to correctly calculate rref(A), then why did it fail with rank(A)? Does it mean that calculating rank(A) is more error prone than calculating rref(A), or vice versa? I mean rref(A) actually tells you the rank, but here's a contradiction.
P.S. I've checked, Python makes the same error.
EDIT 1: Here is the matrix A itself. The first 9 columns were given. The rest was obtained with polynomial features.
EDIT 2: I was able to found a similar issue. Here is 10x10 matrix B of rank 10 (and octave calculates its rank correctly). However, octave says that rank(B * B) = 9 which is impossible.
The distinction between an invertible matrix (i.e. full rank) and a non-invertible one is clear-cut in theory, but not so in practice. A matrix B with large condition number (as in your example) can be inverted, but computing the inverse is numerically unstable. It roughly corresponds to B having a determinant that is "small" (using an appropriate, relative measure of "small"), so the matrix is almost singular. As a result, the inverse matrix will be computed with bad accuracy. In your example B, the condition number (computed with cond) is 2.069e9.
Another way to look at this is: when the condition number is large, it well could be that B is "really" singular, but small numerical errors from previous computations make it look barely non-singular. So you can't be sure.
The rank and rref functions use different algorithms (singular-value decomposition for rank, Gauss-Jordan elimination with partial pivoting for rref). For well-behaved matrices the numerical errors will be small in both cases, and the results will be consistent. But for a bad-conditioned matrix the numerical errors will be large and potentially different in each case, giving inconsistent results.
This is a well known issue with numerical algebra. In general, avoid inverting matrices with large condition number.

Pearson correlation coefficent

This question of mine is not tightly related to Matlab, but is relevant to it:
I'm looking how to fill in the matrix [[a,b,c],[d,e,f]] in a few nontrivial ways so that as many places as possible in
corrcoef([a,b,c],[d,e,f])
are zero. My attempts yield NaN result in most cases.
Given the current comments, you are trying to understand how two series of random draws from two distributions can have zero correlation. Specifically, exercise 4.6.9 to which you refer mentions draws from two normal distributions.
An issue with your approach is that you are hoping to derive a link between a theoretical property and experimentation, in this case using Matlab. And, as you seem to have noticed, unless you are looking at specific degenerate cases, your experimentation will fail. That is because although the true correlation parameter rho in the exercise might be zero, a sample of random draws will always have some level of correlation. Here is an illustration, and as you'll notice if you run it the actual correlations span the whole spectrum between -1 and 1 despite their average being zero (as it should be since both generators are pseudo-uncorrelated):
n=1e4;
experiment = nan(n,1);
for i=1:n
r = corrcoef(rand(4,1),rand(4,1));
experiment(i)=r(2);
end
hist(experiment);
title(sprintf('Average correlation: %.4f%%',mean(experiment)));
If you look at the definition of Pearson correlation in wikipedia, you will see that the only way this can be zero is when the numerator is zero, i.e. E[(X-Xbar)(Y-Ybar)]=0. Though this might be the case asymptotically, you will be hard-pressed to find a non-degenerate case where this will happen in a small sample. Nevertheless, to show you you can derive some such degenerate cases, let's dig a bit further. If you want the expectation of this product to be zero, you could make either the left or the right hand part zero when the other is non-zero. For one side to be zero, the draw must be exactly equal to the average of draws. Therefore we can imagine creating such a pair of variables using this technique:
we create two vectors of 4 variables, and alternate which draw will be equal to the average.
let's say we want X to average 1, and Y to average 2, and we make even-indexed draws equal to the average for X and odd-indexed draws equal to the average for Y.
one such generation would be: X=[0,1,2,1], Y=[2,0,2,4], and you can check that corrcoef([0,1,2,1],[2,0,2,4]) does in fact produce an identity matrix. This is because, every time a component of X is different than its average of 1, the component in Y is equal to its average of 2.
another example, where the average of X is 3 and that of Y is 4 is: X=[3,-5,3,11], Y=[1008,4,-1000,4]. etc.
If you wanted to know how to create samples from non-correlated distributions altogether, that would be and entirely different question, though (perhaps) more interesting in terms of understanding statistics. If this is your case, and given the exercise you mention discusses normal distributions, I would suggest you take a look at generating antithetic variables using the Box-Muller transform.
Happy randomizing!

Kalman filter +process noise covariance

I have been working on Kalman filters in recent times. There are surely a lot of terms involved , which need to be understood thoroughly to implement it and optimize it well. I have no clear understanding when it comes to deciding upon the process error covariances, process noise covariance and the measurement noise covariance.
Error covariances are still okay to work with as they basically define the uncertainity in the actual state and the assumed/estimated state and the correlation between the uncertainities of the state vector components. These covariances are uodated on each successive iteration and gradually merge to a minimum value as the estimates become more accurate in comparison to the actual state over time.
For process noise and measurement noise covariance matrices, I started with an assumed value of 3 x 3 identity matrix as Q (HIT AND TRIAL). The results didn't look so promising so I tried plugging in this matrix:
Q = [(T^5)/20, (T^4)/8, (T^3)/6;
(T^4)/8, (T^3)/3, (T^2)/2;
(T^3)/6, (T^2)/2, T];
(Found this matrix in some paper, T is the sampling time)
This seemed to work fine and provide good results. It worked, but the reasoning behind it isn't clear to me. Also tried various other matrices like:
Q = 0.0001*diag([0.1 0.1 0.1]);
Even this seemed to give good results. I read at a few places on the net, that choosing an overly large value for Q will result in a misbehaved filter.
Kindly help me on how to choose the 'Q' matrix. Are there some guidelines for the same.
Coming to the measurement noise covariance matrix R, reading up a bit on the net, I decided to choose a calculated for noise covariance as measurement noise covariance. This again resulted in inaccurate results. So, I had to give in to the hit and trial method and ended up choosing R as [1]
This works fine for now, nut again I'm not satisfied by this trial and error method of choosing values.
It'll be great, If someone can help me with the clarifications.
Thanks
If you are still interested in the question, here is the answer.
While real object dynamics, that you are tracking with Kalman filter, correspond dynamics of your filter (that is written in matrix A), you don't need covariance matrix Q at all. In that case gain coefficients of your filter decrease from step to step. That's right because filter knows your object from step to step better and better and finally doesn't need measurements at all.
But! If object dynamics differ from matrix A, than filter lag error increases from step to step on the same reason. Matrix Q solves this problem. Q states for expected difference between real object dynamics and matrix A.
For instance, matrix A equals
(1 T; 0 1)
for two-dimensional state. If the object you are tracking accelerates, expected difference equals
(T*T/2; T)*acceleration
Therefore, additional prediction error equals
(T^4/4 T^3/2; T^3/2 T^2)*acceleration^2
That is matrix Q. I hope it helps you.

Query about NaiveBayes Classifier

I am building a text classifier for classifying reviews as positive or negative. I have a query on NaiveBayes classifier formula:
| P(label) * P(f1|label) * ... * P(fn|label)
| P(label|features) = --------------------------------------------
| P(features)
As per my understanding, probabilities are multiplied if the events occur together. E.g. what is the probability of A and B occurring together. Is it appropriate to multiply the probabilities in this case? Appreciate if someone can explain this formula in a bit detail. I am trying to do some manual classification (just to check some algorithm generated classifications which seem a tad off, this will enable me to identify the exact reason for misclassification).
In basic probability terms, to calculate p(label|feature1,feature2), we have to multiply the probabilites to calculate the occurrence of feature 1 and feature 2 together. But in this case I am not trying to calculate a standard probability, rather the strength of positivity/negativity of the text. So if I sum up the probabilities, I get a number which can identify the positivity/negativity quotient. This is a bit unconventional but do you think this can give some good results. The reason is the sum and product can be quite different. E.g. 2*2 =4 but 3*1 = 3
The class-conditional probabilities P(feature|label) can be multiplied together if they are statistically independent. However, it's been found in practice that Naive Bayes still produces good results even for class-conditional probabilities that are not independent. Thus, you can compute the individual class-conditional probabilities P(feature|label) from simple counting and then multiply them together.
One thing to note is that in some applications, these probabilities can be extremely small, resulting in potential numerical underflow. Thus, you may want to add together the logs of the probabilities (rather than multiply the probabilities).
I understand if the features were different like what is the probability of a person being male if the height was 170 cm and weight 200 pounds. Then these probabilities have to be multiplied together as these conditions (events) occur together. But in case of text classification, this is not valid as it really doesn't matter if the events occur together.. E.g. the probability of a review being positive given the occurrence of word best is 0.1 and the probability of a review being positive given the occurrence of word polite is 0.05, then the probability of the review being positive given the occurrence of both words (best and polite) is not 0.1*0.05. A more indicative number would be the sum of the probabilities (needs to be normalized),

Minimizing error of a formula in MATLAB (Least squares?)

I'm not too familiar with MATLAB or computational mathematics so I was wondering how I might solve an equation involving the sum of squares, where each term involves two vectors- one known and one unknown. This formula is supposed to represent the error and I need to minimize the error. I think I'm supposed to use least squares but I don't know too much about it and I'm wondering what function is best for doing that and what arguments would represent my equation. My teacher also mentioned something about taking derivatives and he formed a matrix using derivatives which confused me even more- am I required to take derivatives?
The problem that you must be trying to solve is
Min u'u = min \sum_i u_i^2, u=y-Xbeta, where u is the error, y is the vector of dependent variables you are trying to explain, X is a matrix of independent variables and beta is the vector you want to estimate.
Since sum u_i^2 is diferentiable (and convex), you can evaluate the minimal of this expression calculating its derivative and making it equal to zero.
If you do that, you find that beta=inv(X'X)X'y. This maybe calculated using the matlab function regress http://www.mathworks.com/help/stats/regress.html or writing this formula in Matlab. However, you should be careful how to evaluate the inverse (X'X) see Most efficient matrix inversion in MATLAB