CurveRep in Hmisc for clustering longitudinal curves based on 3 time points - cluster-analysis

I am working on the following project and am exploring the CurveRep() clustering approach provided by Hmisc. (CurveRep clusters individual subjects' longitudinal growth curves according to similar patterns based on the CLARA clustering algorithm). As I haven't found any publication using CurveRep() and generally very little discussion about it on the internet, I would be grateful if you could let me know your experience with it or what you think about it!
- My project: I have about 200 metabolites measured in n=500 subjects at three time points (0,30,120min). Individual time courses vary quite a bit, but in Spaghetti plots, there appear to be groups (e.g. straight & flat curves, peak-shaped curves, valley-curves). I would like to cluster these curves into two or three representative time courses and would then fit a curve-specific regression model for each cluster. CurveRep() seems exactly what I am looking for and it produces acceptable cluster solutions (although solutions are more based on different y-axis intersections rather than different growth patterns).
Is it any good? Are there alternative clustering algorithms that group according to similar longitudinal change (e.g., cluster 1 = "linear rising", cluster 2 = "valley-shaped")?
Thanks a lot!
Chris

Three time points is too little for all the time-series methods to wpork for you. Look at DTW - it is designed for much higher resolution.
Clustering algorithms such as k-means, PAM and CLARA could work for you. Look at the cluster centers.
It may be necessary to preprocess your data more carefully.
If you are interested in change instead of absolute values, encode your data accordingly. For example,
x1, x2, x3 -> x2-x1, x3-x2
or
x1,x2,x3 -> x1-mu,x2-mu,x3-mu with mu=(x1+x2+x3)/3
this will make the clustering results more likely to match your motivation.

Related

How to identify found clusters in Lumer Faieta Ant clustering

I have been experimenting with Lumer-Faieta clustering and I am getting
promising results:
However, as clusters formed I was wondering how to identify the final clusters? Do I run another clustering algorithm to identify the clusters (that seems counter-productive)?
I had the idea of starting each data point in its own cluster. Then, when a laden ant drops a data point, its gets the same cluster as the data points that dominates its neighborhood. The problem with this is that if clusters are broken up, they share share the same cluster number.
I am stuck. Any suggestions?
To solve this problem, I employed DBSCAN as a post processing step. The effect as follows:
Given that we have a projection of a high dimensional problem on a 2D grid, with known distances and uniform densities, DBSCAN is ideal for this problem. Choosing the right value for epsilon and the minimum number of neighbours are trivial (I used 3 for both values). Once the clusters have been identified, it can be projected back to the n-dimension space.
See The 5 Clustering Algorithms Data Scientists Need to Know for a quick overview (and graphic demo) of DBSCAN and some other clustering algorithms.

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.

Expectation Maximization Issue - How to find the optimum number of gaussians within the data

Is there any algorithm or trick of how to determine the number of gaussians which should be identified within a set of data before applying the expectation maximization algorithm?
For example, in the above illustrated plot of 2 - Dimensional data, when I apply the Expectation Maximization algorithm, I try to fit 4 gaussians to the data and I would obtain the following result.
But what if I wouldn't knew the number of gaussians within the data? Is there any algorithm or trick which I could apply so that I could find out this detail?
This might be a bit of a retread, since others already linked the wiki article of the actual cluster number determination, but I found that article a lil overly dense, so I thought I'd provide a brief, intuitive answer:
Basically, there isn't a universally 'correct' answer for the number of clusters in a data set -- the fewer clusters, the smaller the description length but the higher the variance, and in all non-trivial datasets the variance won't completely go away unless you have a Gaussian for each point, which renders the clustering useless (this is a case of the more general phenomena known as the 'futility of bias free learning': A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances).
So you basically have to pick some feature of your dataset to maximize via the number of clusters (see the wiki article on inductive bias for some example features)
In other sad news, in all such cases finding the number of clusters is known to be NP-hard, so the best you can expect is a good heuristic approach.
Wikipedia has an article on this subject. I am not too familiar with the subject, but I've been told that clustering algorithms that don't require specifying the number of clusters instead need some density information about the clusters or some minimum distance between clusters.
Non parametric bayesian clustering is now getting lot of attention. You dont need to specify clusters.
Autoclass is algorithm that automatically identify number of clusters from mixture.

Find connected components in a graph in MATLAB

I have many 3D data points, and I wish to find 'connected components' in this graph. This is where clusters are formed that exhibit the following properties:
Each cluster contains points all of which are at most distance from another point in the cluster.
All points in two distinct clusters are at least distance from each other.
This problem is described in the question and answer here.
Is there a MATLAB implementation of such an algorithm built-in or available on the FEX? Simple searches have not thrown up anything useful.
Perhaps a density-based clustering algorithm can be applied in this case. See this related question for a description of the DBscan algorithm.
I do not think that it is possible to satisfy both conditions in all cases.
If you decide to concentrate on the first condition, you can use Complete-Linkage hierarichical clustering, in which points or groups of points are merged based on the maximum distance between any two points. In Matlab, this is implemented in CLUSTERDATA (see help for the individual function steps).
To calculateyour cluster indices, you'd run
clusterIndex = clusterdata(coordiantes,maxDistance,'criterion','distance','linkage','complete','distance','euclidean')
In case you then want to simply eliminate points of different clusters that are less than minDistance apart, you can run pdist between clusters to clean up your connected components.
k-means or k-medoid algorithm may be useful in this case.