I've been trying for a long time to figure out how to perform (on paper)the K-medoids algorithm, however I'm not able to understand how to begin and iterate. for example:
I have the distance matrix between 6 points, the k,C1 and C2.
I'll be very happy if someone can show me please how to perform the K-medoids algorithm on this example? how to start and iterate?
Thanks
A bit more of details then:
Set K to the desired number of clusters, lets use 2.
Choose randomly K entities to be the medoids m_1, m_2. Lets choose X_3 (Lets call this cluster 1) and X_5 (Cluster 2).
Assign a given entity to the cluster represented by its closest medoid. Cluster 1 will be made of entities (X_1, X_2, X_3 - just check your table, these are closer to X_3 than to X_5), cluster 2 will be (X_4, X_5, X_6).
Update the medoids. A medoid of a cluster should be the entity with the smallest sum of distances to all other entities within the same cluster. X_2 will be the new medoid for cluster 1, and X_4 for cluster 2.
Now what you have to do repeat steps 3-4 until convergence. So,
5- Assign each entity to the cluster of the closest medoid (now these are X_2 and X_4). Cluster one is now made of entities (X_1, X_2, X_3 and X_6), Cluster 2 will be (X_4, X_5).
(there was a change in the entities in each cluster, so iterations must continue.
6- The entity with the smallest sum of distances in cluster one is still X_2, in cluster 2 they are the same, so x_4 stays.
Another iteration
7- As there was no change in the medoids, the clusters will stay the same. This means its time to stop the iterations
Output: 2 clusters. Cluster 1 has entities (X_1, X_2, X_3, X_6), and cluster 2 has entities (X_4 and X_5).
Now, if I had started this using different initial medoids maybe I'd get a different clustering... you may wish to check the build algorithm for initialisation.
You have clusters C1 and C2 given.
Find the most central element in each cluster.
Compute new C1 and C2.
Repeat 1. and 2. until convergence
Related
I have used kmeans to cluster a population in Matlab and then I run a disease in the population and nodes that have the disease more than 80% of the time are excluded from the clustering - meaning the clusters change each iteration. Then it reclusters over 99 timesteps. How do I create a vector/list of which clusters a specific node has belonged to over the whole time period?
I have tried using the vector created by kmeans called 'id' but this doesn't include the nodes that are excluded from the clustering so I cannot track one specific node as the size of id changes each time. This is what I tried and ran it in the for loop so it plotted a line plot for each iteration:
nt = [nt sum(id(1,:))];
Only problem was that the first row in the vector obviously changed every timestep so it wasn't the same person.
This is my initial simple clustering:
%Cluster the population according to these features
[id, c] = kmeans(feats', 5);
Then this is the reclustering process to exclude those who have the disease for more than 80% of the time (this part is in a big for loop in the whole code):
Lc = find(m < 0.8);
if t > 1,
[id, c, sumD, D] = kmeans(feats(:, Lc)', 5, 'Start', c);
else,
[id, c, sumD, D] = kmeans(feats(:, Lc)', 5);
end;
I want to be able to plot and track the fate of specific nodes in my population which is why I want to know how their cluster groups change. Any help would be much appreciated!
This question already has answers here:
Evaluating K-means accuracy
(2 answers)
Closed 5 years ago.
How effectively evaluate the performance of the standard matlab k-means implementation.
For example I have a matrix X
X = [1 2;
3 4;
2 5;
83 76;
97 89]
For every point I have a gold standard clustering. Let's assume that (83,76), (97,89) is the first cluster and (1,2), (3,4), (2,5) is the second cluster. Then we run matlab
idx = kmeans(X,2)
And get the following results
idx = [1; 1; 2; 2; 2]
According the the NOMINAL values it's very bad clustering because only (2,5) is correct, but we don't care about nominal values, we care only about points that is clustered together. Therefore somehow we have to identify that only (2,5) gets to the incorrect cluster.
For me a newbie in matlab is not a trivial task to evaluate the performance of clustering. I would appreciate if you can share with us your ideas about how to evaluate the performance.
To evaluate the "best clustering" is somewhat ambiguous, especially if you have points in two different groups that may eventually cross over with respect to their features. When you get this case, how exactly do you define which cluster those points get merged to? Here's an example from the Fisher Iris dataset that you can get preloaded with MATLAB. Let's specifically take the sepal width and sepal length, which is the third and fourth columns of the data matrix, and plot the setosa and virginica classes:
load fisheriris;
plot(meas(101:150,3), meas(101:150,4), 'b.', meas(51:100,3), meas(51:100,4), 'r.', 'MarkerSize', 24)
This is what we get:
You can see that towards the middle, there is some overlap. You are lucky in that you knew what the clusters were before hand and so you can measure what the accuracy is, but if we were to get data such as the above and we didn't know what labels each point belonged to, how do you know which cluster the middle points belong to?
Instead, what you should do is try and minimize these classification errors by running kmeans more than once. Specifically, you can override the behaviour of kmeans by doing the following:
idx = kmeans(X, 2, 'Replicates', num);
The 'Replicates' flag tells kmeans to run for a total of num times. After running kmeans num times, the output memberships are those which the algorithm deemed to be the best over all of those times kmeans ran. I won't go into it, but they determine what the "best" average is out of all of the membership outputs and gives you those.
Not setting the Replicates flag obviously defaults to running one time. As such, try increasing the total number of times kmeans runs so that you have a higher probability of getting a higher quality of cluster memberships. By setting num = 10, this is what we get with your data:
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
2
2
2
1
1
You'll see that the first three points belong to one cluster while the last two points belong to another. Even though the IDs are flipped, it doesn't matter as we want to be sure that there is a clear separation between the groups.
Minor note with regards to random algorithms
If you take a look at the comments above, you'll notice that several people tried running the kmeans algorithm on your data and they received different clustering results. The reason why is because when kmeans chooses the initial points for your cluster centres, these are chosen in a random fashion. As such, depending on what state their random number generator was in, it is not guaranteed that the initial points chosen for one person will be the same as another person.
Therefore, if you want reproducible results, you should set the random seed of your random seed generator to be the same before running kmeans. On that note, try using rng with an integer that is known before hand, like 123. If we did this before the code above, everyone who runs the code will be able to reproduce the same results.
As such:
rng(123);
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
1
1
1
2
2
Here the labels are reversed, but I guarantee that if any else runs the above code, they will get the same labelling as what was produced above each time.
So I have an input vector, A which is a row vector with 3,000 data points. Using MATLAB, I found 3 cluster centres for A.
Now that I have the 3 cluster centres, I have another row Vector B with 3000 points. The elements of B have one of three values: 1, 2 or 3. So say for e.g if the first 5 elements of B are
B(1,1:5) = [ 1 , 3, 3, 2, 1]
This means that B(1,1) belongs to cluster 1, B(1,2) belongs to cluster 3 etc. What I am trying to do is for every data point in the row vector B, I look at what cluster it belongs to by reading its value and then replace it with a data value from that cluster.
So after the above is done, the first 5 elements of B would look like:
B(1,1:5) = [ 2.7 , 78.4, 55.3, 19, 0.3]
Meaning that B(1,1) is a data value picked from the first cluster (that we got from A), B(1,2) is a data value picked from the third cluster (that we got from A) etc.
k-means only keeps means, it does not model the data distribution.
You cannot generate artificial data sensibly from k-means clusters without additional statistics and distribution assumptions.
I'm working with Mean shift, this procedure calculates where every point in the data set converges. I can also calculate the euclidean distance between the coordinates where 2 distinct points converged but I have to give a threshold, to say, if (distance < threshold) then this points belong to the same cluster and I can merge them.
How can I find the correct value to use as threshold??
(I can use every value and from it depends the result, but I need the optimal value)
I've implemented mean-shift clustering several times and have run into this same issue. Depending on how many iterations you're willing to shift each point for, or what your termination criteria is, there is usually some post-processing step where you have to group the shifted points into clusters. Points that theoretically shift to the same mode need not practically end up on directly top of each other.
I think the best and most general way to do this is to use a threshold based on the kernel bandwidth, as suggested in the comments. In the past my code to do this post processing has usually looked something like this:
threshold = 0.5 * kernel_bandwidth
clusters = []
for p in shifted_points:
cluster = findExistingClusterWithinThresholdOfPoint(p, clusters, threshold)
if cluster == null:
// create new cluster with p as its first point
newCluster = [p]
clusters.add(newCluster)
else:
// add p to cluster
cluster.add(p)
For the findExistingClusterWithinThresholdOfPoint function I usually use the minimum distance of p to each currently defined cluster.
This seems to work pretty well. Hope this helps.
I have a simple 2-dimensional dataset that I wish to cluster in an agglomerative manner (not knowing the optimal number of clusters to use). The only way I've been able to cluster my data successfully is by giving the function a 'maxclust' value.
For simplicity's sake, let's say this is my dataset:
X=[ 1,1;
1,2;
2,2;
2,1;
5,4;
5,5;
6,5;
6,4 ];
Naturally, I would want this data to form 2 clusters. I understand that if I knew this, I could just say:
T = clusterdata(X,'maxclust',2);
and to find which points fall into each cluster I could say:
cluster_1 = X(T==1, :);
and
cluster_2 = X(T==2, :);
but without knowing that 2 clusters would be optimal for this dataset, how do I cluster these data?
Thanks
The whole point of this method is that it represents the clusters found in a hierarchy, and it is up to you to determine how much details you want to get..
Think of this as having a horizontal line intersecting the dendrogram, which moves starting from 0 (each point is its own cluster) all the way to the max value (all points in one cluster). You could:
stop when you reach a predetermined number of clusters (example)
manually position it given a certain height value (example)
choose to place it where the clusters are too far apart according to the distance criterion (ie there's a big jump to the next level) (example)
This can be done by either using the 'maxclust' or 'cutoff' arguments of the CLUSTER/CLUSTERDATA functions
To choose the optimal number of clusters, one common approach is to make a plot similar to a Scree Plot. Then you look for the "elbow" in the plot, and that is the number of clusters you pick. For the criterion here, we will use the within-cluster sum-of-squares:
function wss = plotScree(X, n)
wss = zeros(1, n);
wss(1) = (size(X, 1)-1) * sum(var(X, [], 1));
for i=2:n
T = clusterdata(X,'maxclust',i);
wss(i) = sum((grpstats(T, T, 'numel')-1) .* sum(grpstats(X, T, 'var'), 2));
end
hold on
plot(wss)
plot(wss, '.')
xlabel('Number of clusters')
ylabel('Within-cluster sum-of-squares')
>> plotScree(X, 5)
ans =
54.0000 4.0000 3.3333 2.5000 2.0000