K-means distance parameters in Matlab - Varying results - matlab

I have a matrix I am working with which 300x5000 and I wanted to test which distance calculation parameter is the most effective. I got the following results:
'Sqeuclidean' = 17 iterations, total sum of distances = 25175.4
'Correlation' = 9 iterations, total sum of distances = 32.7
'Cityblock' = 34 iterations, total sum of distances = 105175.3
'Cosine' = 11 iterations, total sum of distances = 11.9
I am having trouble understanding why the results vary so much and how to choose the most effective distance parameter. Any advice?
EDIT:
I have 300 features with 5000 instances of each feature.
the function looks like this:
[idx, ctrs, sumd, d] = kmeans(matrix, 25, 'distance', 'cityblock', 'replicate', 20)
with interchanging the distance parameter. The features were already normalized.
Thanks!

As slayton commented, you really need to define what 'best' means for your particular problem.
The only thing that matters is how well the distance function clusters the data. In general, clustering is highly-dependent on the distance function. The two metrics that you've selected (number of iterations, sum of distances) are pretty irrelevant to how well the clustering works.
You need to know what you're trying to achieve with clustering, and you need some metric for how well you've achieved that goal. If there's an objective metric to determine how good your clusters are, then use that. Often, the metric is fuzzier: does this look right when I visualize the data. Look at your data, and look at how each distance function clusters the data. Select the distance function that seems to generate the best clusters. Do this for several subsets of your data, to make sure that your intuition is correct. You should also try to understand the result that each distance function gives you.
Lastly, some problems lend themselves to a particular distance function. If your problem has spatial features, then a Euclidean (geometric) distance is often a natural choice. Other distance functions will perform better for different problems.

Distance values from different
distance functions
data sets
normalizations
are generally not comparable. Simple example from reality: measure distances in "meter" or in "inch", and you get very different results. The result in meters will not be better just because it is measured on a different scale. So you must not compare the variances of different results.
Notice that k-means is meant to be used with euclidean distance only, and may not converge with other distance functions. IMHO, L_p norms should be fine, and on TF-IDF maybe also cosine. But I do not know a proof for that.
Oh, and k-means works really bad with high-dimensional data. It is meant for low dimensionality.

Related

Understanding 3D distance outputs in matlab

Being neither great at math nor coding, I am trying to understand the output I am getting when I try to calculate the linear distance between pairs of 3D points. Essentially, I have the 3D points of a bird that is moving in a confined area towards a stationary reward. I would like to calculate the distance of the animal to the reward at each point. However, when looking online for the best way to do this, I tried several options and get different results that I'm not sure how to interpret.
Example data:
reward = [[0.381605200000000,6.00214980000000,0.596942400000000]];
animal_path = = [2.08638710671220,-1.06496059617432,0.774253689976102;2.06262715454806,-1.01019576900787,0.773933446776898;2.03912411242035,-0.954888684677576,0.773408777383975;2.01583648760496,-0.898935333316342,0.772602855030873];
distance1 = sqrt(sum(([animal_path]-[reward]).^2));
distance2 = norm(animal_path - reward);
distance3 = pdist2(animal_path, reward);
Distance 1 gives 3.33919107083497 13.9693378592353 0.353216791787775
Distance 2 gives 14.3672145652704
Distance 3 gives 7.27198528565078
7.21319284516199
7.15394253573951
7.09412041863743
Why do these all yield different values (and different numbers of values)? Distance 3 seems to make the most sense for my purposes, even though the values are too large for the dimensions of the animal enclosure, which should be something like 3 or 4 meters.
Can someone please explain this in simple terms and/or point me to something less technical and jargon-y than the Matlab pages?
There are many things mathematicians call distance. What you normally associate with distance is the eucledian distance. This is what you want in this situation. The length of the line between two points. Now to your problem. The Euclidean distance distance is also called norm (or 2-norm).
For two points you can use the norm function, which means with distance2 you are already close to a solution. The problem is only, you input all your points at once. This does not calculate the distance for each point, instead it calculates the norm of the matrix. Something of no interest for you. This means you have to call norm once for each row point on the path:
k=nan(size(animal_path,1),1)
for p=1:size(animal_path,1),
k(p)=norm(animal_path(p,:) - reward);
end
Alternatively you can follow the idea you had in distance1. The only mistake you made there, you calculated the sum for each column, where the sum of each row was needed. Simple fix, you can control this using the second input argument of sum:
distance1 = sqrt(sum((animal_path-reward).^2,2))

Principal component analysis in matlab?

I have a training set with the size of (size(X_Training)=122 x 125937).
122 is the number of features
and 125937 is the sample size.
From my little understanding, PCA is useful when you want to reduce the dimension of the features. Meaning, I should reduce 122 to a smaller number.
But when I use in matlab:
X_new = pca(X_Training)
I get a matrix of size 125973x121, I am really confused, because this not only changes the features but also the sample size? This is a big problem for me, because I still have the target vector Y_Training that I want to use for my neural network.
Any help? Did I badly interpret the results? I only want to reduce the number of features.
Firstly, the documentation of the PCA function is useful: https://www.mathworks.com/help/stats/pca.html. It mentions that the rows are the samples while the columns are the features. This means you need to transpose your matrix first.
Secondly, you need to specify the number of dimensions to reduce to a priori. The PCA function does not do that for you automatically. Therefore, in addition to extracting the principal coefficients for each component, you also need to extract the scores as well. Once you have this, you simply subset into the scores and perform the reprojection into the reduced space.
In other words:
n_components = 10; % Change to however you see fit.
[coeff, score] = pca(X_training.');
X_reduce = score(:, 1:n_components);
X_reduce will be the dimensionality reduced feature set with the total number of columns being the total number of reduced features. Also notice that the number of training examples does not change as we expect. If you want to make sure that the number of features are along the rows instead of the columns after we reduce the number of features, transpose this output matrix as well before you proceed.
Finally, if you want to automatically determine the number of features to reduce to, one method to do so is to calculate the variance explained of each feature, then accumulate the values from the first feature up to the point where we exceed some threshold. Usually 95% is used.
Therefore, you need to provide additional output variables to capture these:
[coeff, score, latent, tsquared, explained, mu] = pca(X_training.');
I'll let you go through the documentation to understand the other variables, but the one you're looking at is the explained variable. What you should do is find the point where the total variance explained exceeds 95%:
[~,n_components] = max(cumsum(explained) >= 95);
Finally, if you want to perform a reconstruction and see how well the reconstruction into the original feature space performs from the reduced feature, you need to perform a reprojection into the original space:
X_reconstruct = bsxfun(#plus, score(:, 1:n_components) * coeff(:, 1:n_components).', mu);
mu are the means of each feature as a row vector. Therefore you need add this vector across all examples, so broadcasting is required and that's why bsxfun is used. If you're using MATLAB R2018b, this is now implicitly done when you use the addition operation.
X_reconstruct = score(:, 1:n_components) * coeff(:, 1:n_components).' + mu;

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.

How to select top 100 features(a subset) which are most relevant after pca?

I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions.
How do i extract the column names for the top 100 features which are most important so that i can perform regression on them?
PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab to get them.
The eigenvalues indicate how much of your data each eigenvector explains. A simple method for selecting features would be to select the 100 features with the highest eigen values. This gives you a set of feature which explain most of the variance in the data.
If you need to justify your approach for a write up you can actually calculate the amount of variance explained per eigenvector and cut of at, for example, 95% variance explained.
Bear in mind that selecting based solely on eigenvalue, might not correspond to the set of features most important to your regression, so if you don't get the performance you expect you might want to try a different feature selection method such as recursive feature selection. I would suggest using google scholar to find a couple of papers doing something similar and see what methods they use.
A quick matlab example of taking the top 100 principle components using PCA.
[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));
Have you tried with
B = sort(your_matrix,2,'descend');
C = B(:,1:100);
Be careful!
With just 63 observations and 2308 variables, your PCA result will be meaningless because the data is underspecified. You should have at least (rule of thumb) dimensions*3 observations.
With 63 observations, you can at most define a 62 dimensional hyperspace!

Using triplequad to calculate density (in Matlab)

As i've explained in a previous question: I have a dataset consisting of a large semi-random collection of points in three dimensional euclidian space. In this collection of points, i am trying to find the point that is closest to the area with the highest density of points.
As high performance mark answered;
the most straightforward thing to do would be to divide your subset of
Euclidean space into lots of little unit volumes (voxels) and count
how many points there are in each one. The voxel with the most points
is where the density of points is at its highest. Perhaps initially
dividing your space into 2 x 2 x 2 voxels, then choosing the voxel
with most points and sub-dividing that in turn until your criteria are
satisfied.
Mark suggested i use triplequad for this, but this is not a function i am familiar with, or understand very well. Does anyone have any pointers on how i could go about using this function in Matlab for what i am trying to do?
For example, say i have a random normally distributed matrix A = randn([300,300,300]), how could i use triplequad to find the point i am looking for? Because as i understand currently, i also have to provide triplequad with a function fun when using it. Which function should that be for this problem?
Here's an answer which doesn't use triplequad.
For the purposes of exposition I define an array of data like this:
A = rand([30,3])*10;
which gives me 30 points uniformly distributed in the box (0:10,0:10,0:10). Note that in this explanation a point in 3D space is represented by each row in A. Now define a 3D array for the counts of points in each voxel:
counts = zeros(10,10,10)
Here I've chosen to have a 10x10x10 array of voxels, but this is just for convenience, it would be only a little more difficult to have chosen some other number of voxels in each dimension, and there don't have to be the same number of voxels along each axis. Then the code
for ix = 1:size(A,1)
counts(ceil(A(ix,1)),ceil(A(ix,2)),ceil(A(ix,3))) = counts(ceil(A(ix,1)),ceil(A(ix,2)),ceil(A(ix,3)))+1
end
will count up the number of points in each of the voxels in counts.
EDIT
Unfortunately I have to do some work this afternoon and won't be able to get back to wrestling with the triplequad solution until later. Hope this is OK in the meantime.