Clusters based on distance - cluster-analysis

Here is my problem: I have a list of villages. For each village I computed the path distance between them and prepared a distance matrix. Now I want to identify clusters of villages which are close to each other.
I use Python 2.7 and I already used hierarchical clustering (provided by scypy) to cluster the distance matrix. By looking at it as a human being, I can identify the nearest villages, but I need to automate it. I need to get the elements which belong to each cluster.

I was also wondering how to retrieve the clusters once I had created and cut the dendrogram. Since this is unanswered and may come up for others with a similar question, I'll answer according to what I was looking for, making some assumptions since this is an old question.
The first step is that you need to determine where to cut the dendrogram. You can do this a variety of ways, but I'll assume you already know how to do this, since you're looking at the dendrogram and seem to have satisfied yourself that you have clustered the data. If you don't know where to cut, you could start with something simple like cutting at the max distance. But really, where to cut is a different, very long discussion which I will assume you have figured out how to do (since I had done so at this point in my search).
Now I assume you have a dendrogram, and you know where to cut it, and maybe you even have it plotted with the cut line. But you want to do something more with the clusters, so you need to label the points you clustered. This can be done using the flat cluster (fcluster()) function in scipy.
from scipy.cluster.hierarchy import fcluster
clusters=fcluster(Z,distance,criterion='distance')
print(clusters)
Z is the hierarchical linkage matrix (as from scipy's linkage() function) which I assume you had already created. distance is the distance at which you are cutting the dendrogram (but there are other ways to cut the dendrogram, see source for how to do this with fcluster).
This returns a numpy array denoting which observation is in which cluster. Now you can append this to your data as a new column and go to town (or village) with it.

Related

Providing Centroids and Then Clustering

I seem to find a lot of documentation based on computing centroids and clustering, but what if I assign centroid values themselves.
Say if I provide 14 different centroid vectors. How would I go about clustering my data to those 14 different centroid values?
Maybe this is an easy question, but I haven't found an answer online, so wanted to make sure.
If the centroids are predefined, then you are doing nearest-neighbor classification, not clustering. It's only clustering if the structure is not predefined.
Not sure this belongs in the python forum, but you just need to compute the distance from each of your points to each centroid, and then assign each point to that centroid that is closest. You then have your clusters, though some may be empty (no guarantee that a centroid will have at least one data point closest to it). You can do this by iterating over all of your points, or do it much more quickly in one step using matrices with numpy. I've got some code lying around somewhere if you need an example to get started.

Clustering in matlab

I have a 3d box with some points in it (1800).
Like this:
Now I have to cluster these points and it can't be done with k-means because you don't now the number of clusters. An other problem is that the box is periodic. So the points at the side top and bottom can belong to eacht other. Like in this image:
The right en left belong to each other.
How can I define these clusters with a specific distance as threshold, and implement that the box is periodic (so when you are one the end of one axis look at the beginning if these distances are below the threshold)?
Kind regards,
Glenn
The Wikipedia article on cluster analysis will answer your question.
Look for density based clustering algorithms, as your data looks very much like the design scenario of density based clustering to me.
Well, first things first, you can indeed use K-Means. Of course you will need to use a cluster validity index (google Silhouette width index, Calinski-Harabasz index, Dunn's index, etc.).
If you really don't want to use K-Means for some other reason, you may wish to use a hierarchical clustering algorithm such as the Ward Method (description in Wikipedia). You won't need to know the number of clusters a priori (however, can you truly claim that you are creating a taxonomy without being able to answer the most basic of questions: how many taxons are there?).
The fact that your box is periodic raises an interesting challenge. My first thought here is that the best way to approach the problem is not by changing the distance measure (which you could do), but by transforming the data (feature extraction).
Your box has 6 sides, but because its periodic its like if it had 3 sides. So, the left side and right side are "the same" (as are the top and bottom, and the front and back).
How about redefining each object over three features? each feature is the distance between the object and one of the "three" sides.
Best of luck!

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.

Prediction avoiding landmass

I am working on a project where the following functions has to be implemented.
Predict the location of the ships (in maritime environment) into a future time (Can be done with Kalman filter, IMM filter and some other algorithms).
Ships can be any part of the world.
Avoiding landmass during prediction
Shortest path along the shorelines
I am totally done with the first part which is predicting without considering the shoreline information. I have
problem with the functions 2 and 3.
Problem in function 2
At times, your predicted location can fall into the landmass area which is totally unacceptable.
I am using following coastal area shp file http://openstreetmapdata.com/data/coastlines
This file has converted XY values of the world shoreline data.
I have loaded this shp file into postgreSQL and used postgis to read it from the database.
So my idea is to go through all the polygons (shoreline defined based on polygons) and checking whether the line connecting the present location and the predicted location
crosses the polygon. If it crosses, that means we have to find the where the ship intercept the shoreline first.
So if I follow this approach going through all the polygons, it is going to take time forever. (It has around 62000 polygons with each of them has 1000's of
points). So any advice on this? I thought about initially dividing the worldmap into hierachical areas (Level 1 : 10 polygons, Level 2: Each polygon has 10 polygons inside).
But I am not sure how to divide the world map with the above shp file into the level of polygons I require.
Or any functionality of postgis helpful for this? or any other libraries for this purpose. I believe this kind of functionality should be available already. But I could not
able to figure it out sofar.
Function 3
Since now we know where does the ship intercept the shoreline first, we can predict it along the shoreline using the shortest path algorithm given we know
the destination information. But to do this, you need to divide the above shoreline map into grids so the shortest path can be used.
So how can you make grids based on this along the shorelines? I am not doing image processing here. What I have is this shp file now. Any advice is appreciated.
or should I go with some image processing approach and make the grid shorelines. if so please provide some links.
First, PostGIS is pretty fast, and with the proper indexes, as long as you keep your polygons reasonably small, you should be able to make up for the number of them with good indexing and overlapping operator support (overlapping polygons can use GIST and GIN indexes, with the latter performing better than the former for reads and worse for writes).
62000 polygons globally is nothing. Write back when you are having to check more than a few thousand whose bounding boxes overlap with your line....
For the third problem, you are going one direction, right? I am wondering how hard it would be to write a tangent(point, vector, polygon) function which would return the closest tangent to a polygon along a certain vector (a vector could be represented by a (point, point) tuple). If you were to combine this with KNN searches, you ought to be able to plot a course using a WITH RECURSIVE query.

Clustering on non-numeric dimensions

I recently started working on clustering and k-means algorithm and was trying to come up with a good use case and solve it.
I have the following data about the items sold in different cities.
Item City
Item1 New York
Item2 Charlotte
Item1 San Francisco
...
I would like to cluster the data based on variables city and item to find groups of cities that might have similar patterns for the items sold.The problem is the k-means I use do not accept non-numeric input. Any idea how should I proceed with this to find a meaningful solution.
Thanks
SV
Clustering requires a distance definition. A cluster is only a cluster if the items are "closer" according to some distance function. The closer they are, the more likely they belong to the same cluster.
In your case, you can try to cluster based on various data related to the cities, like their geographical coordinates, or demographic informations, and see if the clusters overlap in the various cases !
In order for k-means to produce usable results, the means must be meaningful.
Even if you would e.g. use binary vectors, k-means on these would not make a lot of sense IMHO.
Probably the best use case to get started with k-means is color quantization. Take a picture, and use the RGB values of every pixel as 3d vectors. Then run k-means with k as the desired number of colors. The color centers are your final palette, and every pixel will be mapped to the closest center for color reduction.
The reason why this works well with k-means are twofold:
the mean actually makes sense for finding the mean color of multiple pixels
the axes R, G and B have a similar meaning and scale, so there is no bias
If you want to step beyond, try to do the same e.g. in HSB space. And you'll run into difficulties if you want it to be really good. Because the hue value is cyclic, which is inconcistent with the mean. Assuming the hue is on 0-360 degrees, then the "mean" hue of "1" and "359" is not 180 degrees, but 0. So on this data, k-means results will be suboptimal.
See e.g. https://en.wikipedia.org/wiki/Color_quantization for details as well as the two dozen k-means questions here with respect to sparse and binary data.
You may still need to abstractly represent your data in numerical form. This May Help
http://www.analyticbridge.com/forum/topics/clustering-with-non-numeric?commentId=2004291%3AComment%3A40805
Try to re-analyze the problem again and Understand if there is any relationship that you can take advantage of and represent in numerical form.
I worked on a Project where I had to represent Colors by their RGB values. It worked preety good.
Hope this helps