What is the best way to cluster a dataset with no labels and no idea of the number of clusters required?
For example, using the Iris dataset with no labels or knowledge of the number of label classes.
My idea:
Compute the mean square distance from each of the existing clusters for a sample
*If mean square distance > some threshold by a factor that depends (penalizes) on k, then, add that a “new” candidate.
*If a new cluster was added, find the new “best” k+1 cluster centers
If no new cluster was added, go to next row
What you can do is plot the elbow curve at different K-values as described here
Specifically,
1) The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 10 in the examples above), and for each value of k calculate the sum of squared errors (SSE).
2) Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best
3) So our goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k
Dozens of methods have been proposed on how to choose k.
Some variants such as x-means can dynamically adjust k, you only need to give the maximum - and choose the quality criterion AIC or BIC.
Related
A proof of concept prototype I have to do for my final year project is to implement K-Means Clustering on a big data set and display the results on a graph. I only know object-oriented languages like Java and C# and decided to give MATLAB a try. I notice that with a functional language the approach to solving problems is very different, so I would like some insight on a few things if possible.
Suppose I have the following data set:
raw_data
400.39 513.29 499.99 466.62 396.67
234.78 231.92 215.82 203.93 290.43
15.07 14.08 12.27 13.21 13.15
334.02 328.79 272.2 306.99 347.79
49.88 52.2 66.35 47.69 47.86
732.88 744.62 687.53 699.63 694.98
And I picked row 2 and 4 to be the 2 centroids:
centroids
234.78 231.92 215.82 203.93 290.43 % Centroid 1
334.02 328.79 272.2 306.99 347.79 % Centroid 2
I want to now compute the euclidean distances of each point to each centroid, then assign each point to it's closest centroid and display this on a graph. Let's say I want I want to classify the centroids as blue and green. How can I do this in MATLAB? If this was Java I would initialise each row as an object and add to separate ArrayLists (representing the clusters).
If rows 1, 2 and 3 all belong to the first centroid / cluster, and rows 4, 5 and 6 belong to the second centroid / cluster - how can I classify these to display them as blue or green points on a graph? I am new to MATLAB and really curious about this. Thanks for any help.
(To begin with, Matlab has a flexible distance measuring function, pdist2 and also kmeans implementation, but I'm assuming that you want to build your code from scratch).
In Matlab, you try to implement everything as matrix algebra, without loops over elements.
In your case, if R is the raw_data matrix and C is the centroids matrix,
you can shift the dimension that represents centroid number to the 3rd place by
permC=permute(C,[3 2 1]); Then the bsxfun function allows you to subtract C from R while expanding R's third dimension as necessary: D=bsxfun(#minus,R,permC). Element-wise square followed by summation across columns SqD=sum(D.^2,2) will give you the squared distances of each observation from each centroid. Performing all these operations within a single statement and shifting the third (centroid) dimension back to the 2nd place will look like this:
SqD=permute(sum(bsxfun(#minus,R,permute(C,[3 2 1])).^2,2),[1 3 2])
Picking the centroid of minimal distance is now straightforward: [minDist,minCentroid]=min(SqD,[],2)
If this looks complex, I recommend inspecting the product of each sub-step and reading the help of each command.
I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis.
Essentially, I am looking for large but flat clusters.
Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers
Solution 1:
Scale your data set to match your desired epsilon.
In your case, scale z by 50.
Solution 2:
Use a weighted distance function.
E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis.
This may be the most convenient option, since you are already using ELKI.
Just define a custom distance metric when computing the DBSCAN core points. The standard DBSCAN uses the Euclidean distance to compute points within an epsilon. So all dimensions are treated the same.
However, you could use the Mahalanobis distance to weigh each dimension differently. You can use a diagonal covariance matrix for flat clusters. You can use a full symmetric covariance matrix for flat tilted clusters, etc.
In your case, you would use a covariance matrix like:
100 0 0 0 100 0 0 0 0.04
In the pseudo code provided at the Wikipedia entry for DBSCAN just use one of the distance metrics suggested above in the regionQuery function.
Update
Note: scaling the data is equivalent to using an appropriate metric.
I have a data set of n = 1000 realizations of a random variable X and is univariate -- X = {x1, x2,...,xn}. Data is generated by varying a parameter on which the random variable depends. For example, let the r.v be Area of a circle. So, by varying the radius (keeping the dimension fixed - say 2 dimensional circle) I generate n area for radius in the range r = 5 to n.
By using fitdist command I can fit distribution to the data set choosing distributions like Normal, Kernel, Binomial etc. Thus, data set is fitted to k distribution. So, I get k distributions. How do I select the Best fit distribution and hence the pdf ?
Also, do I need to normalize (post process) the data always in the range [0,1] before fitting?
If I understand correctly, you are asking how to decide which distribution to choose once you have a few fits.
There are three major metrics (IMO) for measuring "goodness-of-fit":
Chi-Squared
Kolmogrov-Smirnov
Anderson-Darling
Which to choose depends on a large number of factors; you can randomly pick one or read the Wiki pages to figure out which suits your need. These tests are also a part of MATLAB.
For instance, you can use kstest for the Kolmogrov-Smirnov test. You can provide the data and the hypothesized distribution to the function and evaluate the different options based on the KS test.
Alternately, you can use Anderson-Darling through adtest or Chi-Squared through chi2gof.
Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to
I'm trying to implement k means clustering.
I've a set of points with coordinates (x,y) and i am using Euclidean distance for finding distance. I've computed distance between all points in a matrix
dist[i][j] - distance between points i and j
when i choose a[1][3] farthest from pt 1 as 3.
then when i search farthest from 3 i may get a[3][j] but a[1][j] may be minimum.
[pt j is far from pt3 but near to 1]
so how to choose k farthest points using the distance matrix.
Note that the k-farthest points do not necessarily yield the best result: they clearly aren't the best cluster center estimates.
Plus, since k-means heuristics may get stuck in a local minimum, you will want a randomized algorithm that allows you to restart the process multiple times and get potentiall different results.
You may want to look at k-means++ which is a known good heuristic for k-means initialization.