Principal component analysis and feature reductions - matlab

I have a matrix composed of 35 features, I need to reduce those
feature because I think many variable are dependent. I undertsood PCA
could help me to do that, so using matlab, I calculated:
[coeff,score,latent] = pca(list_of_features)
I notice "coeff" contains matrix which I understood (correct me if I'm wrong) have column with high importance on the left, and second column with less importance and so on. However, it's not clear for me which column on "coeff" relate to which column on my original "list_of_features" so that I could know which variable is more important.

PCA doesn't give you an order relation on your original features (which feature is more 'important' then others), rather it gives you directions in feature space, ordered according to the variance, from high variance (1st direction, or principle component) to low variance. A direction is generally a linear combination of your original features, so you can't expect to get information about a single feature.
What you can do is to throw away a direction (one or more), or in other words project you data into the sub-space spanned by a subset of the principle components. Usually you want to throw the directions with low variance, but that's really a choice which depends on what is your application.
Let's say you want to leave only the first k principle components:
x = score(:,1:k) * coeff(:,1:k)';
Note however that pca centers the data, so you actually get the projection of the centered version of your data.

Related

About argument of PCA function in Matlab

I have a 115*8000 data where 115 is the number of features. When I use pca function of matlab like this
[coeff,score,latent,tsquared,explained,mu] = pca(data);
on my data. I get some values. I read on here that how can I reduce my data but one thing confuses me. The explained data shows how much a feature weighs on calculation but do features get reorganized in this proces or features are exactly in same order as I give it to function?
Also I give 115 features but explained shows 114. Why does it happen?
The data is not "reorganized" in PCA, is transformed to a new space. When you crop the PCA space, that is your data, but you are not going to be able to visualize/understand it there, you need to convert it back to "normal" space, using eigenvectors and such.
explained gives you 114 because you now what is the answer with 115! 100% of the data can be explained with the whole data!
Read about it further in this answer: Significance of 99% of variance covered by the first component in PCA
PCA does not "choose" some of your features and remove the rest.
So you should not still be thinking about the original features after running PCA.
It is well-explained here on Wikipedia. You are converting your samples from the space defined by your original features to a space where features are linearly uncorrelated and called "principal components". Note: these components are no longer the original features.
An example of this in 2D could be: you have a vector z=(2,3) defined in your Euclidean space. It needs 2 features (the x and the y). If we change the space and define it using the coordinate vectors v=(2,3) and w an orthogonal vector to v, then z=(1,0) i.e. z=1.v+0.w and can now be represented with only 1 feature (the first coordinate!).
The link that you shared explains exactly (in the selected answer) how you can go about using the outputs of the pca function to reduce your dimensionality.
(As noted by Ander you do not care about the last components since these are the weakest anyway and you want to drop them)

In DBSCAN, what does eps represent actually?

Suppose that I have already found the eps for all density. I applied the methodology from here http://ijiset.com/v1s4/IJISET_V1_I4_48.pdf
If you don't mind, please open page 5 and see at Proposed Algorithm section. At step 10.1, the paper tells us to calculate the number of objects in eps-neighborhood.
What does eps represent actually? It is a radius to draw a circle right? So, why the radius is so small, smaller than distances between two objects? If so, the MinPts will be 0 forever.
Yes, if used with Euclidean distance, then it is a radius.
It is not infinitely small (it does not tend to 0). It's just supposed to be small compared to the data set extends, but the authors could have named it "r" instead.
Use the original paper to understand the algorithm, not some indian journal variant of it.
In Euclidean distance, it is the radius. Selection of Eps is a little difficult.
This problem is related to model selection, i.e., the selection of a particular model and its corresponding parametrization. In the case of k-means (which requires from the user the number of clusters as input) there is a plethora of measures in the literature that can help in the selection of the best number of clusters, for instance: silhouette, c-index, dunn, davies-bouldin. These measures are the so-called relative validity criteria.
In the case of Density-based clustering algorithms, there are some measures too, for instance: CDbw and DBCV.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

How to find the "optimal" cut-off point (threshold)

I have a set of weighted features for machine learning. I'd like to reduce the feature set and just use those with a very large or very small weight.
So given below image of sorted weights, I'd only like to use the features that have weights above the higher or below the lower yellow line.
What I'm looking for is some kind of slope change detection so I can discard all the features until the first/last slope coefficient increase/decrease.
While I (think I) know how to code this myself (with first and second numerical derivatives), I'm interested in any established methods. Perhaps there's some statistic or index that computes something like that, or anything I can use from SciPy?
Edit:
At the moment, I'm using 1.8*positive.std() as positive and 1.8*negative.std() as negative threshold (fast and simple), but I'm not mathematician enough to determine how robust this is. I don't think it is, though. ⍨
If the data are (approximately) Gaussian distributed, then just using a multiple
of the standard deviation is sensible.
If you are worried about heavier tails, then you may want to base your analysis on order
statistics.
Since you've plotted it, I'll assume you're willing to sort all of the
data.
Let N be the number of data points in your sample.
Let x[i] be the i'th value in the sorted list of values.
Then 0.5( x[int( 0.8413*N)]-x[int(0.1587*N)]) is an estimate of the standard deviation
which is more robust against outliers. This estimate of the std can be used as you
indicated above. (The magic numbers above are the fraction of data that are
less than [mean+1sigma] and [mean-1sigma] respectively).
There are also conditions where just keeping the highest 10% and lowest 10% would be
sensible as well; and these cutoffs are easily computed if you have the sorted data
on hand.
These are somewhat ad hoc approaches based on the content of your question.
The general sense of what you're trying to do is (a form of) anomaly detection,
and you can probably do a better job of it if you're careful in defining/estimating
what the shape of the distribution is near the middle, so that you can tell when
the features are getting anomalous.

Process for comparing two datasets

I have two datasets at the time (in the form of vectors) and I plot them on the same axis to see how they relate with each other, and I specifically note and look for places where both graphs have a similar shape (i.e places where both have seemingly positive/negative gradient at approximately the same intervals). Example:
So far I have been working through the data graphically but realize that since the amount of the data is so large plotting each time I want to check how two sets correlate graphically it will take far too much time.
Are there any ideas, scripts or functions that might be useful in order to automize this process somewhat?
The first thing you have to think about is the nature of the criteria you want to apply to establish the similarity. There is a wide variety of ways to measure similarity and the more precisely you can describe what you want for "similar" to mean in your problem the easiest it will be to implement it regardless of the programming language.
Having said that, here is some of the thing you could look at :
correlation of the two datasets
difference of the derivative of the datasets (but I don't think it would be robust enough)
spectral analysis as mentionned by #thron of three
etc. ...
Knowing the origin of the datasets and their variability can also help a lot in formulating robust enough algorithms.
Sure. Call your two vectors A and B.
1) (Optional) Smooth your data either with a simple averaging filter (Matlab 'smooth'), or the 'filter' command. This will get rid of local changes in velocity ("gradient") that appear to be essentially noise (as in the ascending component of the red trace.
2) Differentiate both A and B. Now you are directly representing the velocity of each vector (Matlab 'diff').
3) Add the two differentiated vectors together (element-wise). Call this C.
4) Look for all points in C whose absolute value is above a certain threshold (you'll have to eyeball the data to get a good idea of what this should be). Points above this threshold indicate highly similar velocity.
5) Now look for where a high positive value in C is followed by a high negative value, or vice versa. In between these two points you will have similar curves in A and B.
Note: a) You could do the smoothing after step 3 rather than after step 1. b) Re 5), you could have a situation in which a 'hill' in your data is at the edge of the vector and so is 'cut in half', and the vectors descend to baseline before ascending in the next hill. Then 5) would misidentify the hill as coming between the initial descent and subsequent ascent. To avoid this, you could also require that the points in A and B in between the two points of velocity similarity have high absolute values.