Significance of 99% of variance covered by the first component in PCA - matlab

What does it mean/signify when the first component covers for more than 99% of the total variance in PCA analysis ?
I have a feature vector of size 500X1000 on which I used Matlab's pca function which returns [coeff,score,latent,tsquared,explained]. The variable 'explained' returns the percentage of variance covered by each component.

The explained tells you how accurately you could represent the data by just using that principal component. In your case it means that just using the main principal component, you can describe very accurately (to a 99%) the data.
Lets make a 2D example. Imagine you have data that is 100x2 and you do PCA.
the result could be something like this (taken from the internets)
This data will give you an explained value for the first principal component (PCA 1st dimension big green arrow in the figure) of around 90%.
What does it means?
It means that if you project all your data to that line, you will reconstruct the points with 90% of accuracy (of course, you will loose the information in the PCA 2nd dimension direction).
In your example, with 99% it visually means that almost all the points in blue are laying on the big green arrow, with very little variation in the small green arrow direction.
Of course it is way more difficult to visualize with 1000 dimensions instead of 2, but I hope you understand.

Related

About argument of PCA function in Matlab

I have a 115*8000 data where 115 is the number of features. When I use pca function of matlab like this
[coeff,score,latent,tsquared,explained,mu] = pca(data);
on my data. I get some values. I read on here that how can I reduce my data but one thing confuses me. The explained data shows how much a feature weighs on calculation but do features get reorganized in this proces or features are exactly in same order as I give it to function?
Also I give 115 features but explained shows 114. Why does it happen?
The data is not "reorganized" in PCA, is transformed to a new space. When you crop the PCA space, that is your data, but you are not going to be able to visualize/understand it there, you need to convert it back to "normal" space, using eigenvectors and such.
explained gives you 114 because you now what is the answer with 115! 100% of the data can be explained with the whole data!
Read about it further in this answer: Significance of 99% of variance covered by the first component in PCA
PCA does not "choose" some of your features and remove the rest.
So you should not still be thinking about the original features after running PCA.
It is well-explained here on Wikipedia. You are converting your samples from the space defined by your original features to a space where features are linearly uncorrelated and called "principal components". Note: these components are no longer the original features.
An example of this in 2D could be: you have a vector z=(2,3) defined in your Euclidean space. It needs 2 features (the x and the y). If we change the space and define it using the coordinate vectors v=(2,3) and w an orthogonal vector to v, then z=(1,0) i.e. z=1.v+0.w and can now be represented with only 1 feature (the first coordinate!).
The link that you shared explains exactly (in the selected answer) how you can go about using the outputs of the pca function to reduce your dimensionality.
(As noted by Ander you do not care about the last components since these are the weakest anyway and you want to drop them)

Principal component analysis and feature reductions

I have a matrix composed of 35 features, I need to reduce those
feature because I think many variable are dependent. I undertsood PCA
could help me to do that, so using matlab, I calculated:
[coeff,score,latent] = pca(list_of_features)
I notice "coeff" contains matrix which I understood (correct me if I'm wrong) have column with high importance on the left, and second column with less importance and so on. However, it's not clear for me which column on "coeff" relate to which column on my original "list_of_features" so that I could know which variable is more important.
PCA doesn't give you an order relation on your original features (which feature is more 'important' then others), rather it gives you directions in feature space, ordered according to the variance, from high variance (1st direction, or principle component) to low variance. A direction is generally a linear combination of your original features, so you can't expect to get information about a single feature.
What you can do is to throw away a direction (one or more), or in other words project you data into the sub-space spanned by a subset of the principle components. Usually you want to throw the directions with low variance, but that's really a choice which depends on what is your application.
Let's say you want to leave only the first k principle components:
x = score(:,1:k) * coeff(:,1:k)';
Note however that pca centers the data, so you actually get the projection of the centered version of your data.

Algorithm to detect a linear behaviour in a data set

I have posted a question about an Algorithm to make a polynomial fit of a part of a data set some time ago and received some propositions to do what I wanted. But I face another problem now I try to apply the ideas suggested in the answers.
My goal was to find the best linear fit of a data set, in which only a part of it was linear.
Here is an example of what I must do :
We have these two data sets, and I must make a linear trend of the linear part of the data that is at the left of the dashed line. In red, we have the ideal data set, that has a linear part from the beginning until the dashed line. In blue, we have the 'problematic' data set, that has a plateau. The bold part is the part that I have to use to do the linear fit of the data.
My problem is that I tried to do as mentionned in the question linked above : I found the second order derivative of the smoothed data and looked when it was not 'close enough' of 0. But here are my results for the problematic data set (first image) and for the ideal data set (second image) :
(Sorry for quality, I don't know why it is so blurred)
On both images, I plotted the first order derivative and in red, the second order derivative. On the first image, we see peaks of second derivative values. But the problem is that the peaks are not very 'high', making it difficult to establish a threshold that would tell if the set is linear or not... On the contrary, the peak of the first derivative is quite high, making it easy to see visually.
I thought that calculate the mean of the values of the first derivative and look when the value differ too much from the mean value would be enough... But when I take the mean of the values of the first derivative in order to see where the values differ from the mean value, there is a sort of offset due to the peak.
How to remove this offset in order to take only the mean value of the data at the right (the data at the left of the discontinuity that is seen on Image 1 could be non linear or be linear but have a different value from the values at the right!) of the peak efficiently ?
The mean operator (as you have noticed) is very sensitive to outliers (peaks). You may wish to use more robust estimators, such as the median or the x-percentile of the values (which should be more appropriate for your case).

How to fix ROC curve with points below diagonal?

I am building receiver operating characteristic (ROC) curves to evaluate classifiers using the area under the curve (AUC) (more details on that at end of post). Unfortunately, points on the curve often go below the diagonal. For example, I end up with graphs that look like the one here (ROC curve in blue, identity line in grey) :
The the third point (0.3, 0.2) goes below the diagonal. To calculate AUC I want to fix such recalcitrant points.
The standard way to do this, for point (fp, tp) on the curve, is to replace it with a point (1-fp, 1-tp), which is equivalent to swapping the predictions of the classifier. For instance, in our example, our troublesome point A (0.3, 0.2) becomes point B (0.7, 0.8), which I have indicated in red in the image linked to above.
This is about as far as my references go in treating this issue. The problem is that if you add the new point into a new ROC (and remove the bad point), you end up with a nonmonotonic ROC curve as shown (red is the new ROC curve, and dotted blue line is the old one):
And here I am stuck. How can I fix this ROC curve?
Do I need to re-run my classifier with the data or classes somehow transformed to take into account this weird behavior? I have looked over a relevant paper, but if I am not mistaken, it seems to be addressing a slightly different problem than this.
In terms of some details: I still have all the original threshold values, fp values, and tp values (and the output of the original classifier for each data point, an output which is just a scalar from 0 to 1 that is a probability estimate of class membership). I am doing this in Matlab starting with the perfcurve function.
Note based on some very helpful emails about this from the people that wrote the articles cited above, and the discussion above, the right answer seems to be: do not try to "fix" individual points in an ROC curve unless you build an entirely new classifier, and then be sure to leave out some test data to see if that was a reasonable thing to do.
Getting points below the identity line is something that simply happens. It's like getting an individual classifier that scores 45% correct even though the optimal theoretical minimum is 50%. That's just part of the variability with real data sets, and unless it is significantly less than expected based on chance, it isn't something you should worry too much about. E.g., if your classifier gets 20% correct, then clearly something is amiss and you might look into the specific reasons and fix your classifier.
Yes, swapping a point for (1-fp, 1-tp) is theoretically effective, but increasing sample size is a safe bet too.
It does seem that your system has a non-monotonic response characteristic so be careful not to bend the rules of the ROC too much or you will impact the robustness of the AUC.
That said, you could try to use a Pareto Frontier Curve (Pareto Front). If that fits the requirements of "Repairing Concavities" then you'll basically sort the points so that the ROC curve becomes monotonic.

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.