Does matlab's k-mean recompute the cluster membership of seeds? - matlab

I am unable to find the details of matlab's k-mean about seeds. That if matlab's k-mean recomputes the cluster assignment of Xs seeds, which is subset of data set X matrix.
Or these seeds are only used for initial centered location and are not considered in k-means cluster assignment phase?
I want to semi-supervised clustering by seeds by Sugato Basu et.al
It might be a naive question but your answer would make this confusion clearer.
Thanks in advance.

Did you check the documentation: doc kmeans? There they use the term initial cluster centroid positions to refer to seeds.
In particular, look at the parameter named start which is used to specify the seeds, and the replicates parameter. Also see the Algorithms section, where they discuss the two phases of the process (batch update and online update). Finally, and perhaps best of all, you can look at the code directly with edit kmeans, and step through it with the debugger.
It's not clear to me exactly what your question is but, from the above, I would answer that the seeds are calculated once according to the 'start' parameter, followed by the batch update and online update. This is repeated according to the 'replicates' parameter.
I don't know what "semi-supervised clustering by seeds" is, but I'm pretty sure it's not supported out of the box.

Related

K-means empty action with intel DAAL library

In the MATLAB version of the K-means algorithm, there is a very useful flag that indicates the action to take if a cluster loses all member observations during the optimization. There are 3 possibilities in MATLAB:
treat empty cluster as an error
remove any clusters that become empty
Create a new cluster consisting of the one point furthest from its centroid
Does any one know what happens in DAAL K-means in that case? I could not find anything in the documentation about this.
In the K-Means implementation of Intel DAAL, clustering information on feature vectors is automatically collected during execution of the program. The feature furthest from their assigned centroids are selected as new cluster centers to compensate for an empty cluster during an iteration.
It is notable that a good choice of cluster initialization can avoid an empty cluster.
Refer https://software.intel.com/en-us/daal-programming-guide-details-5 for further details.

Is there a function in matlab to evaluate homogeneity and completeness of a clustering solution?

I'm working on a clustering problem and I want to evaluate the performance. Given a data set of samples and a testing set containing the truth class assignments of the samples. Is there a function in matlab to evaluate homogeneity and completeness? homogeneity: each cluster contains only members of a single class. completeness: all members of a given class are assigned to the same cluster.
Yes.
Almost every "external" cluster evaluation method is based on these concepts. Theynare maybe not exactly what you described - because they attempt to solve the problem that clusters and classes usually do not align well. If matlab is any good for clustering (ok, it is not very common anymore...), it ought to have several of such measures.

Best way to validate DBSCAN Clusters

I have used the ELKI implementation of DBSCAN to identify fire hot spot clusters from a fire data set and the results look quite good. The data set is spatial and the clusters are based on latitude, longitude. Basically, the DBSCAN parameters identify hot spot regions where there is a high concentration of fire points (defined by density). These are the fire hot spot regions.
My question is, after experimenting with several different parameters and finding a pair that gives a reasonable clustering result, how does one validate the clusters?
Is there a suitable formal validation method for my use case? Or is this subjective depending on the application domain?
ELKI contains a number of evaluation functions for clusterings.
Use the -evaluator parameter to enable them, from the evaluation.clustering.internal package.
Some of them will not automatically run because they have quadratic runtime cost - probably more than your clustering algorithm.
I do not trust these measures. They are designed for particular clustering algorithms; and are mostly useful for deciding the k parameter of k-means; not much more than that. If you blindly go by these measures, you end up with useless results most of the time. Also, these measures do not work with noise, with either of the strategies we tried.
The cheapest are the label-based evaluators. These will automatically run, but apparently your data does not have labels (or they are numeric, in which case you need to set the -parser.labelindex parameter accordingly). Personally, I prefer the Adjusted Rand Index to compare the similarity of two clusterings. All of these indexes are sensitive to noise so they don't work too well with DBSCAN, unless your reference has the same concept of noise as DBSCAN.
If you can afford it, a "subjective" evaluation is always best.
You want to solve a problem, not a number. That is the whole point of "data science", being problem oriented and solving the problem, not obsessed with minimizing some random quality number. If the results don't work in reality, you failed.
There are different methods to validate a DBSCAN clustering output. Generally we can distinguish between internal and external indices, depending if you have labeled data available or not. For DBSCAN there is a great internal validation indice called DBCV.
External Indices:
If you have some labeled data, external indices are great and can demonstrate how well the cluster did vs. the labeled data. One example indice is the RAND indice.https://en.wikipedia.org/wiki/Rand_index
Internal Indices:
If you don't have labeled data, then internal indices can be used to give the clustering result a score. In general the indices calculate the distance of points within the cluster and to other clusters and try to give you a score based on the compactness (how close are the points to each other in a cluster?) and
separability (how much distance is between the clusters?).
For DBSCAN, there is one great internal validation indice called DBCV by Moulavi et al. Paper is available here: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96
Python package: https://github.com/christopherjenness/DBCV

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.

Find connected components in a graph in MATLAB

I have many 3D data points, and I wish to find 'connected components' in this graph. This is where clusters are formed that exhibit the following properties:
Each cluster contains points all of which are at most distance from another point in the cluster.
All points in two distinct clusters are at least distance from each other.
This problem is described in the question and answer here.
Is there a MATLAB implementation of such an algorithm built-in or available on the FEX? Simple searches have not thrown up anything useful.
Perhaps a density-based clustering algorithm can be applied in this case. See this related question for a description of the DBscan algorithm.
I do not think that it is possible to satisfy both conditions in all cases.
If you decide to concentrate on the first condition, you can use Complete-Linkage hierarichical clustering, in which points or groups of points are merged based on the maximum distance between any two points. In Matlab, this is implemented in CLUSTERDATA (see help for the individual function steps).
To calculateyour cluster indices, you'd run
clusterIndex = clusterdata(coordiantes,maxDistance,'criterion','distance','linkage','complete','distance','euclidean')
In case you then want to simply eliminate points of different clusters that are less than minDistance apart, you can run pdist between clusters to clean up your connected components.
k-means or k-medoid algorithm may be useful in this case.