Query about NaiveBayes Classifier - classification

I am building a text classifier for classifying reviews as positive or negative. I have a query on NaiveBayes classifier formula:
| P(label) * P(f1|label) * ... * P(fn|label)
| P(label|features) = --------------------------------------------
| P(features)
As per my understanding, probabilities are multiplied if the events occur together. E.g. what is the probability of A and B occurring together. Is it appropriate to multiply the probabilities in this case? Appreciate if someone can explain this formula in a bit detail. I am trying to do some manual classification (just to check some algorithm generated classifications which seem a tad off, this will enable me to identify the exact reason for misclassification).
In basic probability terms, to calculate p(label|feature1,feature2), we have to multiply the probabilites to calculate the occurrence of feature 1 and feature 2 together. But in this case I am not trying to calculate a standard probability, rather the strength of positivity/negativity of the text. So if I sum up the probabilities, I get a number which can identify the positivity/negativity quotient. This is a bit unconventional but do you think this can give some good results. The reason is the sum and product can be quite different. E.g. 2*2 =4 but 3*1 = 3

The class-conditional probabilities P(feature|label) can be multiplied together if they are statistically independent. However, it's been found in practice that Naive Bayes still produces good results even for class-conditional probabilities that are not independent. Thus, you can compute the individual class-conditional probabilities P(feature|label) from simple counting and then multiply them together.
One thing to note is that in some applications, these probabilities can be extremely small, resulting in potential numerical underflow. Thus, you may want to add together the logs of the probabilities (rather than multiply the probabilities).

I understand if the features were different like what is the probability of a person being male if the height was 170 cm and weight 200 pounds. Then these probabilities have to be multiplied together as these conditions (events) occur together. But in case of text classification, this is not valid as it really doesn't matter if the events occur together.. E.g. the probability of a review being positive given the occurrence of word best is 0.1 and the probability of a review being positive given the occurrence of word polite is 0.05, then the probability of the review being positive given the occurrence of both words (best and polite) is not 0.1*0.05. A more indicative number would be the sum of the probabilities (needs to be normalized),

Related

Pearson correlation coefficent

This question of mine is not tightly related to Matlab, but is relevant to it:
I'm looking how to fill in the matrix [[a,b,c],[d,e,f]] in a few nontrivial ways so that as many places as possible in
corrcoef([a,b,c],[d,e,f])
are zero. My attempts yield NaN result in most cases.
Given the current comments, you are trying to understand how two series of random draws from two distributions can have zero correlation. Specifically, exercise 4.6.9 to which you refer mentions draws from two normal distributions.
An issue with your approach is that you are hoping to derive a link between a theoretical property and experimentation, in this case using Matlab. And, as you seem to have noticed, unless you are looking at specific degenerate cases, your experimentation will fail. That is because although the true correlation parameter rho in the exercise might be zero, a sample of random draws will always have some level of correlation. Here is an illustration, and as you'll notice if you run it the actual correlations span the whole spectrum between -1 and 1 despite their average being zero (as it should be since both generators are pseudo-uncorrelated):
n=1e4;
experiment = nan(n,1);
for i=1:n
r = corrcoef(rand(4,1),rand(4,1));
experiment(i)=r(2);
end
hist(experiment);
title(sprintf('Average correlation: %.4f%%',mean(experiment)));
If you look at the definition of Pearson correlation in wikipedia, you will see that the only way this can be zero is when the numerator is zero, i.e. E[(X-Xbar)(Y-Ybar)]=0. Though this might be the case asymptotically, you will be hard-pressed to find a non-degenerate case where this will happen in a small sample. Nevertheless, to show you you can derive some such degenerate cases, let's dig a bit further. If you want the expectation of this product to be zero, you could make either the left or the right hand part zero when the other is non-zero. For one side to be zero, the draw must be exactly equal to the average of draws. Therefore we can imagine creating such a pair of variables using this technique:
we create two vectors of 4 variables, and alternate which draw will be equal to the average.
let's say we want X to average 1, and Y to average 2, and we make even-indexed draws equal to the average for X and odd-indexed draws equal to the average for Y.
one such generation would be: X=[0,1,2,1], Y=[2,0,2,4], and you can check that corrcoef([0,1,2,1],[2,0,2,4]) does in fact produce an identity matrix. This is because, every time a component of X is different than its average of 1, the component in Y is equal to its average of 2.
another example, where the average of X is 3 and that of Y is 4 is: X=[3,-5,3,11], Y=[1008,4,-1000,4]. etc.
If you wanted to know how to create samples from non-correlated distributions altogether, that would be and entirely different question, though (perhaps) more interesting in terms of understanding statistics. If this is your case, and given the exercise you mention discusses normal distributions, I would suggest you take a look at generating antithetic variables using the Box-Muller transform.
Happy randomizing!

Anomaly in accuracy calculation

I am classifying a dataset with four classes using pretrained VGG19. To calculate accuracy, I used this formula:
accuracy = sum(predictedLabels==testLabels)/numel(predictedLabels) --Eq 1
Then I calculated the confusion matrix using:
confMat = confusionmat(testLabels, predictedLabels) **--Eq 2**
From which I got a matrix with 4 rows and 4 columns since I had 4 classes.
Now, we know that the accuracy formula is also:
Accuracy=TP+TN/(TP+TN+FP+FN) **Eq-3**
So I also calculated Accuracy from my confusion matrix formed through above Eq. 2. where
TP=value in (row==column),
FP=sum of column-TP,
FN=sum of row-TP,
TN=sum of the diagonal-TP
If I am doing above steps alright, then my confusion is that I am getting different accuracies from two methods Eq 1 and Eq 3. The accuracy I am getting with Eq. 1 is equivalent to the formula TP/(TP+TN). so, If this is the case, then Eq. 1 is the wrong formula for calculating accuracy. But, this formula has been used across all matlab deep learning codes.
So, MATLAB is doing something wrong (which has the probability 0, I know) or I am doing something wrong. But, unfortunately, I am unable to pinpoint my mistake.
Now, the question is,
Am I doing it wrong? Where am I missing the step? How to correct it? What is the logical explanation of this anomaly?
EDIT
This anomaly in accuracy calculation happens due to class imbalance problem. that is when, there are different number of samples in each class. therefore, the regular accuracy formula in Eq. 3 will not work in such cases.
The main issue is that negative and positive is for prediction (is this a cat or not), while you are doing classification with more than two categories. The classifier doesn't give you positive and negative (for is it a cat prediction), so it is not possible to relate to answers as true positive or false positive etc. Therefore equation 3 is meaningless, and so is the method for computing TP, TN etc. For example, if TP is row=column as you defined, then these are the accurate values in the diagonal of confMat. But what is TN? According to your definition it is TP (which is the diagonal) minus the diagonal. I hope this helps put things on the write track.

Which scaling technique does it use?

I have a matrix X, the size of which is 100*2000 double. I want to know which kind of scaling technique is applied to matrix X in the following command, and why it does not use z-score to do scaling?
X = X./repmat(sqrt(sum(X.^2)),size(X,1),1);
That scaling comes from linear algebra. That's what we call normalizing by producing a unit vector. Assuming that each row is an observation and each column is a feature, what's happening here is that we are going through every observation that you collected and normalizing each feature value over all observations such that the overall length / magnitude of a particular feature for all observations is set to 1.
The bottom division takes a look at each feature and determines the norm or magnitude of the feature over all observations. Once you find these magnitudes, you then take each feature for each observation and divide by their respective magnitudes.
The reason why unit vectors are often employed is to describe a point in feature space with respect to a set of basis vectors. Normalizing by producing unit vectors gives you the smallest possible way to represent one component in feature space and so what's probably happening here is that the observations are now being transformed such that each component / feature is being represented in terms of a set of basis vectors. Each basis vector is one feature in the data.
Check out the Wikipedia article on Unit Vectors for more details: http://en.wikipedia.org/wiki/Unit_vector

Probability of generating a particular random number, such as in MATLAB

In real probability, there is a 0% chance that a random number p, selected from all of the real numbers in the interval (0,1), will be 0.5. However, what are the odds that
rand == 0.5
in MATLAB? I suppose this is like asking how many double-precision numbers are between zero and one, or maybe there are other factors at play.
No particular info on MATLAB's generator...
In general even simple pseudo-random generators have long enough cycles which would cover all values representable by double.
If MATLAB uses some other form of generating random numbers it would be even better - so assuming it uniformly covers whole range of double values.
I believe probability would be: distance between representable numbers around values you are interested divided by length of the interval. See What is the minimal step in double data type? (.NET) for discussion on the distance.
Looking at this question, we see that there are 262 - 252
doubles in the interval (0 1). Therefore, the probability of picking any single one (like 0.5) would be roughly equal to one divided by this number, or
>> p = 1/(2^62-2^52)
ans =
2.170523997312134e-019
However, as horchler already indicates, it also depends on the type of random number generator you use, as well as MATLAB's implementation thereof. Sadly, I have only basic knowledge on the implementaion details for each, but you can look here for a list of available random number generators in MATLAB and google a bit further for more precise numbers.
I am not sure whether Alexei was trying to say this, but inspired by him I think the probability will indeed be approximately the distance between numbers around 0.5.
Therefore I expect the probability to be approximately:
eps(0.5)
Which evaluates to 1.1102e-16
Given the monotonic nature of the difference between double numbers I would actually think this holds:
eps(0.5-eps(0.5)) <= yourprobability <= eps(0.5)
Implying a range of 5.5511e-17 to 1.1102e-16

How to select top 100 features(a subset) which are most relevant after pca?

I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions.
How do i extract the column names for the top 100 features which are most important so that i can perform regression on them?
PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab to get them.
The eigenvalues indicate how much of your data each eigenvector explains. A simple method for selecting features would be to select the 100 features with the highest eigen values. This gives you a set of feature which explain most of the variance in the data.
If you need to justify your approach for a write up you can actually calculate the amount of variance explained per eigenvector and cut of at, for example, 95% variance explained.
Bear in mind that selecting based solely on eigenvalue, might not correspond to the set of features most important to your regression, so if you don't get the performance you expect you might want to try a different feature selection method such as recursive feature selection. I would suggest using google scholar to find a couple of papers doing something similar and see what methods they use.
A quick matlab example of taking the top 100 principle components using PCA.
[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));
Have you tried with
B = sort(your_matrix,2,'descend');
C = B(:,1:100);
Be careful!
With just 63 observations and 2308 variables, your PCA result will be meaningless because the data is underspecified. You should have at least (rule of thumb) dimensions*3 observations.
With 63 observations, you can at most define a 62 dimensional hyperspace!