Spectral clustering distance/similarity - cluster-analysis

All papers about spectral clustering use similarity matrix as the input to spectral clustering algorithm.
Is it also possible to use pairwise distance matrix? I haven't seen any version of spectral clustering code which would use parwise distance.
I am implementing spectral clustering in matlab and it has the function pdist and the output of this function is pairwise distance matrix.

Similarity or Affinity Matrix gives an idea about the closeness of these data points with respect to each other. Distance on the other hand gives the measure of dis-similarity w.r.t each other. The easiest and most frequently used way of using pairwise distances for Similarity Matrix is to use a Gaussian kernel to get the affinity measure.
For points a and b, let D = pdist(a,b) give you the pairwise distance. Then the similarity for your matrix can be obtained as sim_ab = exp-(D/f) where f is a scaling factor.

Related

Cholesky decomposition for simulation correlated random variables

I have a correlation matrix for N random variables. Each of them is uniformly distributed within [0,1]. I am trying to simulate these random variables, how can I do that? Note N > 2. I was trying to using Cholesky Decomposition and below is my steps:
get the lower triangle of the correlation matrix (L=N*N)
independently sample 10000 times for each of the N uniformly distributed random variables (S=N*10000)
multiply the two: L*S, and this gives me correlated samples but the range of them is not within [0,1] anymore.
How can I solve the problem?
I know that if I only have 2 random variables I can do something like:
1*x1+sqrt(1-tho^2)*y1
to get my correlated sample y. But if you have more than two variables correlated, not sure what should I do.
You can get approximate solutions by generating correlated normals using the Cholesky factorization, then converting them to U(0,1)'s using the normal CDF. The solution is approximate because the normals have the desired correlation, but converting to uniforms is a non-linear transformation and only linear xforms preserve correlation.
There's a transformation available which will give exact solutions if the transformed Var/Cov matrix is positive semidefinite, but that's not always the case. See the abstract at https://www.tandfonline.com/doi/abs/10.1080/03610919908813578.

Matlab kmeans clustering for non linearly separable data

I've a non linearly separable data at my hand. I want to cluster it using K-means implementation in matlab. I want to get the cluster labels for each and every data point, to use them for another classification problem.
The problem is k-means is not giving results as expected. I'm attaching the cluster plot I obtained.
I expected k-means to give clusters as concentric circles as the data looks, but output was arcs. I don't understand why is this happening.
Can you suggest me any other clustering method to acheive my goal.
Before using an algorithm, you should try to understand it: what is the goal of an algorithm, and how does it achieve it. For k-means, Wikipedia tells us the following:
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean
Three concentric circles would have the exact same mean, so k-means is not suitable to separate them. The result is really what you should expect from k-means here.
Now, if you know that your clusters will always be concentric circles, you can simply convert your cartesian (x-y) coordinates to polar coordinates, and use only the radius rho for clustering - as you know that the angle theta doesn't matter:
% Create random data
[x1,y1] = pol2cart(2*pi*rand(1000,1),rand(1000,1));
[x2,y2] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+2);
[x3,y3] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+4);
X = [x1,y1; x2,y2; x3,y3];
% Transform to polar
[theta,rho] = cart2pol(X(:,1),X(:,2));
% k-means clustering
idx = kmeans(rho,3);
% Plot results
hold on
plot(X(idx==1,1), X(idx==1,2), 'r.')
plot(X(idx==2,1), X(idx==2,2), 'g.')
plot(X(idx==3,1), X(idx==3,2), 'b.')
Or more generally: use a suitable kernel for k-means clustering, or use another algorithm.

How to produce quadruple density wavelet coefficients?

I refer to this paper, second-page right-column second-paragraph, where it is stated how to produce quadruple density wavelet coefficients:
if we do not down sample the wavelet coefficients we generate
wavelets with double density, where wavelets of level n are centered every 1/2*2^n. To generate the quadruple density dictionary,we compute the scaling coefficients with double density by not down sampling
them. The next step is to calculate double density wavelet coefficients on the two sets of scaling coefficients - even
and odd - separately.
I am confused how to get two sets of scaling coefficients - even and odd. What does it mean by even and odd?
Is that like, split the original image matrix into two matrices with those only even-index (0,0) (0,2)..... and odd-index (0,1),(0,3)...? What is the advantage?
Thanks
Those who has interest for Overcomplete Wavelet Transform, please look at these two links.
http://www.dahlsys.com/cuda/overcomplete_wavelet/index.html
http://eeweb.poly.edu/~onur/source.html
Thanks

Clustering algorithm with different epsilons on different axes

I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis.
Essentially, I am looking for large but flat clusters.
Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers
Solution 1:
Scale your data set to match your desired epsilon.
In your case, scale z by 50.
Solution 2:
Use a weighted distance function.
E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis.
This may be the most convenient option, since you are already using ELKI.
Just define a custom distance metric when computing the DBSCAN core points. The standard DBSCAN uses the Euclidean distance to compute points within an epsilon. So all dimensions are treated the same.
However, you could use the Mahalanobis distance to weigh each dimension differently. You can use a diagonal covariance matrix for flat clusters. You can use a full symmetric covariance matrix for flat tilted clusters, etc.
In your case, you would use a covariance matrix like:
100 0 0 0 100 0 0 0 0.04
In the pseudo code provided at the Wikipedia entry for DBSCAN just use one of the distance metrics suggested above in the regionQuery function.
Update
Note: scaling the data is equivalent to using an appropriate metric.

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.