Partitioning criteria/method to improve clustering according to points grouping structure - cluster-analysis

I am trying to do some clustering on points based on 2d coordinates.
The point coordinates are here:
points.txt
I use hward clustering for that but the results are not as expected: the point colors on the image below correspond to hward calculated groups, while I would except something that takes more the structure of the points into account (like the groups drawn by hand). I tried the different cluster distances available in scipy (single, complete, average, weighted, centroid, median) but results were not improved.
I am quite new to clustering. Any idea on which method/criteria could help grouping the points similarly to the groups drawn by hand?
Here is an extract of the code that does the clustering using scipy.cluster.hierarchy. Here, points is the array of points coordinates and cutLevel is a tuned parameter based on the highest jump in aggregation distance.
clustering = linkage(points, method = "ward")
clusters = fcluster(clustering, cutLevel, criterion="distance")

Related

K-means boundaries

Is there any way to find boundaries (coordinates) for a x-y data in kmeans clustering. I produced 8 clusters from the xy data which looks like below (each color represent one cluster). I need to get values of the boundaries for each cluster.
The ELKI tool that I usually use for clustering will generate the boundaries for you in the visualization. I don't know if it will also output the coordinates to a file though.
It's called a Voronoi diagram, and you need the dual, the Delaunay Triangulation to build it. You can easily find algorithms for that.
Beware that some edges will go to infinity (just imagine two clusters, how does their boundary look like? What are the coordinates of the boundary?)
Note that on your data set, this clustering does not appear to be very good. The boundaries between clusters look quite arbitrary to me.

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

Center of clusteres in rapidminer

I have six features that are clustered using k-means algorithm in Rapidminer, I want detect outlier data from these. there is centroid table in Rapidminer that show the center of each feature in each cluster. I want to detect outlier using cluster method(k-means) so i have avg within centroid distance-cluster but i want to calculate distance between each data from center of cluster. I don't know how to calculate a center point for each cluster with 6 features in rapidminer? and i have 6 feature for each data how calculate a point for each data and calculate distance of each data to center of cluster in rapidminer?
You can use the Cross Distances operator for this. This calculates the distances between all pairs of examples in two example sets. Use the Extract Cluster Prototype operator to find the cluster centroids and connect the output of this to one of the inputs of the Cross Distances operator. The original example set is connected to the other input. You can change the distance measure in this operator used but the default is Euclidean distance.

Matlab calculate geographical distance to lat/lng polyline

In Matlab I would like to calculate the (shortest) distances between a set of independent points (m-by-2 matrix of lat/lng) and a set of polylines (n-by-2 matrix of lat/lng). The resulting table should be an n-m matrix with distances in KM.
I have rewritten this JavaScript implementation (http://www.bdcc.co.uk/Gmaps/BdccGeo.js) to Matlab, but it does not seem to perform well.
Currently I am working on a project with a relatively large set of data and running into performance issues. I have roughly 40.000 points and 150 polylines. The polylines are subsets of the original set of 40.000 points. With about 15 seconds per polyline, calculating all these distances can take up to an hour. Also, the intermediate matrixes of 40000x150x3 cause out of memory errors on my lesser machines.
Instead of optimizing or revising this implementation I am wondering if Matlab doesn't already have some (smarter) functions built in for this. But as far as I can see, the documentation mainly has information on how to display geodata as opposed to doing calculations on it.
Does anyone know or have experience with these kind of calculations in Matlab? Has anything like this already been written which I can reuse so I don't have to reinvent the wheel. And finally, is this expected performance, given these numbers, or should my function be able to perform much better?

Hierarchical Cluster Analysis in Cluster 3.0

I'm new to this site as well as new to cluster analysis, so I apologize if I violate conventions.
I've been using Cluster 3.0 to perform Hierarchical Cluster Analysis with Euclidean Distance and Average linkage. Cluster 3.0 outputs a .gtr file with a node joining a gene and their similarity score. I've noticed that the first line in the .gtr file always links a gene with another gene followed by the similarity score. But, how do I reproduce this similarity score?
In my data set, I have 8 genes and create a distance matrix where d_{ij} contains the Euclidian distance between gene i and gene j. Then I normalize the matrix by dividing each element by the max value in the matrix. To get the similarity matrix, I subtract all the elements from 1. However, my result does not use the linkage type and differs from the output similarity score.
I am mainly confused how linkages affect the similarity of the first node (the joining of the two closest genes) and how to compute the similarity score.
Thank you!
The algorithm compares clusters using some linkage method, not data points. However, in the first iteration of the algorithm each data point forms its own cluster; this means that your linkage method is actually reduced to the metric you use to measure the distance between data points (for your case Euclidean distance). For subsequent iterations, the distance between clusters will be measured according to your linkage method, which in your case is average link. For two clusters A and B, this is calculated as follows:
where d(a,b) is the Euclidean distance between the two data points. Convince yourself that when A and B contain just one data point (as in the first iteration) this equation reduces itself to d(a,b). I hope this makes things a bit more clear. If not, please provide more details of what exactly you want to do.