Why does classifier accuracy drop after PCA, even though 99% of the total variance is covered? - matlab

I have a 500x1000 feature vector and principal component analysis says that over 99% of total variance is covered by the first component. So I replace 1000 dimension point by 1 dimension point giving 500x1 feature vector(using Matlab's pca function). But, my classifier accuracy which was initially around 80% with 1000 features now drops to 30% with 1 feature even though more than 99% of the variance is accounted by this feature. What could be the explanation to this or are my methods wrong?
(This question partly arises from my earlier question Significance of 99% of variance covered by the first component in PCA)
Edit:
I used weka's principal components method to perform the dimensionality reduction and support vector machines(SVM) classifier.

Principal Components do not necessarily have any correlation to classification accuracy. There could be a 2-variable situation where 99% of the variance corresponds to the first PC but that PC has no relation to the underlying classes in the data. Whereas the second PC (which only contributes to 1% of the variance) is the one that can separate the classes. If you only keep the first PC, then you lose the feature that actually provides the ability to classify the data.
In practice, smaller (lower variance) PCs often are associated with noise so there can be benefit in removing them but there is no guarantee of this.
Consider a case where you have two variables: a person's mass (in grams) and body temperature (in degrees Celsius). You want to predict which people have the flu and which do not. In this case, weight has a much greater variance but probably no correlation to the flu, whereas temperature, which has low variance, has a strong correlation to the flu. After the Principal Components transformation, the first PC will be strongly aligned with mass (since it has much greater variance) so if you dropped the second PC, would be losing almost all of your classification accuracy.
It is important to remember that Principal Components is an unsupervised transformation of the data. It does not consider labels of your training data when calculating the transformation (as opposed to something like Fisher's linear discriminant).

Related

Kalman Filter : How measurement noise covariance matrix and process noise helps in working of kalman filter , can some one explain intuitively?

How process noise covariance and measurement noise covariance are helping better functioning of Kalman filter ?
Can someone explain intuitively without significant equations and math please.
Well, its difficult to explain mathematical things (like kalman filters) without mathematics, but here's my attempt:
There are two parts to a kalman filter, a time update part and a measurement part. In the time update part we estimate the state at the time of observation; in the measurement part we combine (via least squares) our 'predictions' (ie the estimate from the time update) with the measurements to get a new estimate of the state.
So far, no mention of noise. There are two sources of noise: one in the time update part (sometimes called process noise) and one in the measurement part (observation noise). In each case what we need is a measure of the 'size' of that noise, ie the covariance matrix. These are used when we combine the
predictions with the measurements. When we view our predictions as very uncertain (that is, they have a large covariance matrix) the combination will be closer to the measurements than to the predictions; on the other hand when we view our predictions as very good (small covariance) the combination will be closer to the predictions than to the measurements.
So you could look upon the process and observation noise covariances as saying how much to trust the (parts of) the predictions and observations. Increasing, say, the variance of a particular component of the predictions is to say: trust this prediction less; while increasing the variance of a particular measurement is to say: trust this measurement less. This is mostly an analogy but it can be made more precise. A simple case is when the covariance matrices are diagonal. In that case the cost, ie the contrinution to what we are trying to minimise, of a difference between an measurement and the computed value is te square of that difference, divided by the observations variance. So the higher an observations variance, the lower the cost.
Note that out of the measurement part we also get a new state covariance matrix; this is used (along with the process noise and the dynamics) in the next time update when we compute the predicted state covariance.
I think the question of why the covariance is the appropriate measure of the size of the noise is rather a deep one, as is why least squares is the appropriate way to combine the predictions and the measurements. The shallow answer is that kalman filtering and least squares have been found, over decades (centuries in the case of least squares), to work well in many application areas. In the case of kalman filtering I find the derivation of it from hidden markobv models (From Hidden Markov Models to Linear Dynamical Systems by T.Minka, though this is rather mathematical) convincing. In Hidden markov models we seek to find the (conditional) probability of the states given the measurements so far; Minka shows that if the measurements are linear functions of the states and the dynamics are linear and all probability distributions are Gaussian, then we get the kalman filter.

TensorFlow: Binary classification accuracy

In the context of a binary classification, I use a neural network with 1 hidden layer using a tanh activation function. The input is coming from a word2vect model and is normalized.
The classifier accuracy is between 49%-54%.
I used a confusion matrix to have a better understanding on what’s going on. I study the impact of feature number in input layer and the number of neurons in the hidden layer on the accuracy.
What I can observe from the confusion matrix is the fact that the model predict based on the parameters sometimes most of the lines as positives and sometimes most of the times as negatives.
Any suggestion why this issue happens? And which other points (other than input size and hidden layer size) might impact the accuracy of the classification?
Thanks
It's a bit hard to guess given the information you provide.
Are the labels balanced (50% positives, 50% negatives)? So this would mean your network is not training at all as your performance corresponds to the random performance, roughly. Is there maybe a bug in the preprocessing? Or is the task too difficult? What is the training set size?
I don't believe that the number of neurons is the issue, as long as it's reasonable, i.e. hundreds or a few thousand.
Alternatively, you can try another loss function, namely cross entropy, which is standard for multi-class classification and can also be used for binary classification:
https://www.tensorflow.org/api_docs/python/nn/classification#softmax_cross_entropy_with_logits
Hope this helps.
The data set is well balanced, 50% positive and negative.
The training set shape is (411426,X)
The training set shape is (68572,X)
X is the number of the feature coming from word2vec and I try with the values between [100,300]
I have 1 hidden layer, and the number of neurons that I test varied between [100,300]
I also test with mush smaller features/neurons size: 2-20 features and 10 neurons on the hidden layer.
I use also the cross entropy as cost fonction.

Matlab fitcsvm Feature Coefficients

I'm running a series of SVM classifiers for a binary classification problem, and am getting very nice results as far as classification accuracy.
The next step of my analysis is to understand how the different features contribute to the classification. According to the documentation, Matlab's fitcsvm function returns a class, SVMModel, which has a field called "Beta", defined as:
Numeric vector of trained classifier coefficients from the primal linear problem. Beta has length equal to the number of predictors (i.e., size(SVMModel.X,2)).
I'm not quite sure how to interpret these values. I assume higher values represent a greater contribution of a given feature to the support vector? What do negative weights mean? Are these weights somehow analogous to beta parameters in a linear regression model?
Thanks for any help and suggestions.
----UPDATE 3/5/15----
In looking closer at the equations describing the linear SVM, I'm pretty sure Beta must correspond to w in the primal form.
The only other parameter is b, which is just the offset.
Given that, and given this explanation, it seems that taking the square or absolute value of the coefficients provides a metric of relative importance of each feature.
As I understand it, this interpretation only holds for the linear binary SVM problem.
Does that all seem reasonable to people?
Intuitively, one can think of the absolute value of a feature weight as a measure of it's importance. However, this is not true in the general case because the weights symbolize how much a marginal change in the feature value would affect the output, which means that it is dependent on the feature's scale. For instance, if we have a feature for "age" that is measured in years, but than we change it to months, the corresponding coefficient will be divided by 12, but clearly,it doesn't mean that the age is less important now!
The solution is to scale the data (which is usually a good practice anyway).
If the data is scaled your intuition is correct and in fact, there is a feature selection method that does just that: choosing the features with the highest absolute weight. See http://jmlr.csail.mit.edu/proceedings/papers/v3/chang08a/chang08a.pdf
Note that this is correct only to linear SVM.

One Class SVM using LibSVM in Matlab - Conceptual

Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.

Principal component analysis

I have to write a classificator (gaussian mixture model) that I use for human action recognition.
I have 4 dataset of video. I choose 3 of them as training set and 1 of them as testing set.
Before I apply the gm model on the training set I run the pca on it.
pca_coeff=princomp(trainig_data);
score = training_data * pca_coeff;
training_data = score(:,1:min(size(score,2),numDimension));
During the testing step what should I do? Should I execute a new princomp on testing data
new_pca_coeff=princomp(testing_data);
score = testing_data * new_pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
or I should use the pca_coeff that I compute for the training data?
score = testing_data * pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
The classifier is being trained on data in the space defined by the principle components of the training data. It doesn't make sense to evaluate it in a different space - therefore, you should apply the same transformation to testing data as you did to training data, so don't compute a different pca_coef.
Incidently, if your testing data is drawn independently from the same distribution as the training data, then for large enough training and test sets, the principle components should be approximately the same.
One method for choosing how many principle components to use involves examining the eigenvalues from the PCA decomposition. You can get these from the princomp function like this:
[pca_coeff score eigenvalues] = princomp(data);
The eigenvalues variable will then be an array where each element describes the amount of variance accounted for by the corresponding principle component. If you do:
plot(eigenvalues);
you should see that the first eigenvalue will be the largest, and they will rapidly decrease (this is called a "Scree Plot", and should look like this: http://www.ats.ucla.edu/stat/SPSS/output/spss_output_pca_5.gif, though your one may have up to 800 points instead of 12).
Principle components with small corresponding eigenvalues are unlikely to be useful, since the variance of the data in those dimensions is so small. Many people choose a threshold value, and then select all principle components where the eigenvalue is above that threshold. An informal way of picking the threshold is to look at the Scree plot and choose the threshold to be just after the line 'levels out' - in the image I linked earlier, a good value might be ~0.8, selecting 3 or 4 principle components.
IIRC, you could do something like:
proportion_of_variance = sum(eigenvalues(1:k)) ./ sum(eigenvalues);
to calculate "the proportion of variance described by the low dimensional data".
However, since you are using the principle components for a classification task, you can't really be sure that any particular number of PCs is optimal; the variance of a feature doesn't necessarily tell you anything about how useful it will be for classification. An alternative to choosing PCs with the Scree plot is just to try classification with various numbers of principle components and see what the best number is empirically.