So I have an input vector, A which is a row vector with 3,000 data points. Using MATLAB, I found 3 cluster centres for A.
Now that I have the 3 cluster centres, I have another row Vector B with 3000 points. The elements of B have one of three values: 1, 2 or 3. So say for e.g if the first 5 elements of B are
B(1,1:5) = [ 1 , 3, 3, 2, 1]
This means that B(1,1) belongs to cluster 1, B(1,2) belongs to cluster 3 etc. What I am trying to do is for every data point in the row vector B, I look at what cluster it belongs to by reading its value and then replace it with a data value from that cluster.
So after the above is done, the first 5 elements of B would look like:
B(1,1:5) = [ 2.7 , 78.4, 55.3, 19, 0.3]
Meaning that B(1,1) is a data value picked from the first cluster (that we got from A), B(1,2) is a data value picked from the third cluster (that we got from A) etc.
k-means only keeps means, it does not model the data distribution.
You cannot generate artificial data sensibly from k-means clusters without additional statistics and distribution assumptions.
Related
I have used kmeans to cluster a population in Matlab and then I run a disease in the population and nodes that have the disease more than 80% of the time are excluded from the clustering - meaning the clusters change each iteration. Then it reclusters over 99 timesteps. How do I create a vector/list of which clusters a specific node has belonged to over the whole time period?
I have tried using the vector created by kmeans called 'id' but this doesn't include the nodes that are excluded from the clustering so I cannot track one specific node as the size of id changes each time. This is what I tried and ran it in the for loop so it plotted a line plot for each iteration:
nt = [nt sum(id(1,:))];
Only problem was that the first row in the vector obviously changed every timestep so it wasn't the same person.
This is my initial simple clustering:
%Cluster the population according to these features
[id, c] = kmeans(feats', 5);
Then this is the reclustering process to exclude those who have the disease for more than 80% of the time (this part is in a big for loop in the whole code):
Lc = find(m < 0.8);
if t > 1,
[id, c, sumD, D] = kmeans(feats(:, Lc)', 5, 'Start', c);
else,
[id, c, sumD, D] = kmeans(feats(:, Lc)', 5);
end;
I want to be able to plot and track the fate of specific nodes in my population which is why I want to know how their cluster groups change. Any help would be much appreciated!
I have a 200 x 200 adjacency matrix called M. This is the connection strength for 200 nodes (the nodes are numbered 1 to 200 and organized in ascending order in M - i.e., M(23,45) is the connection strength of node 23 and 45). Out of these 200 nodes, I'm interested in three subsets of nodes.
subset1 = [2,34,36,42,69,102,187];
subset2 = [5,11,28,89,107,199];
subset3 = [7,55,60,188];
Using M, I would like to conduct the following operations:
Average of the connection strength within subset1, subset2, and subset3, separately. For instance, for subset1, that would be the mean connection of all possible pairs of nodes 2, 34, 26,..., 187.
Find the connection strength between subset1, subset2, and subset3. This would be the average of connection strength between all pairs of node spanning all possible pairs of the three subsets (average of connection between subset1 & subset2, subset2 & subset3, and subset1 & subset3). Do note that this between connection does not equate putting all the nodes from three subsets into a single matrix (e.g., connection between two subsets is the mean connection of each node in one subset with every node in the other subset).
What I've tried so far was indexing M using a for loop. It was bulky, especially when I have a large number of nodes in each subset. Can someone help?
M1 = M(subset1, subset1);
ind = triu(true(size(M1)), 1); % upper triangle
M1_avg = mean(M1(ind));
I will leave M2_avg and M1_M2_avg to you.
I asked the similar question here: what exactly is the phytree object in matlab?.
Now this is what I did to try to get it.
clear;
d=[4,2,5,4,5,5];
z=seqneighjoin(d);
view(z)
get(z, 'Pointers')
This is the output:
ans =
1 2
3 5
4 6
And the phytree figure in the following. For my understanding, this matrix is the same as the tree field of the phytree object. What is the relation between this matrix and the figure?
You should interpret the array in the following way.
Firstly, you have the four nodes 1, 2, 3 and 4. In the graph you attach, node 1 is labelled Leaf 1; node 2 is labelled Leaf 3; node 3 is labelled Leaf 2; and node 4 is labelled Leaf 4.
Then take each row of the array in turn.
The first row indicates that we first join nodes 1 and 2 - we now call this node 5, as it's the smallest number greater than the four nodes we already have. On the graph, this is the node connecting Leaf 1 and Leaf 3.
Then the second row indicates that we next join nodes 3 and 5 - we now call this node 6, as again it's the smallest number after node 5. On the graph, this is the node connecting the previous join to Leaf 2.
Then the third row indicates that we lastly join nodes 4 and 6 - we don't need to call it anything as it's the final root node, but it would be node 7. On the graph, this is the node connecting the previous join to Leaf 4.
Does that make more sense?
This question already has answers here:
Evaluating K-means accuracy
(2 answers)
Closed 5 years ago.
How effectively evaluate the performance of the standard matlab k-means implementation.
For example I have a matrix X
X = [1 2;
3 4;
2 5;
83 76;
97 89]
For every point I have a gold standard clustering. Let's assume that (83,76), (97,89) is the first cluster and (1,2), (3,4), (2,5) is the second cluster. Then we run matlab
idx = kmeans(X,2)
And get the following results
idx = [1; 1; 2; 2; 2]
According the the NOMINAL values it's very bad clustering because only (2,5) is correct, but we don't care about nominal values, we care only about points that is clustered together. Therefore somehow we have to identify that only (2,5) gets to the incorrect cluster.
For me a newbie in matlab is not a trivial task to evaluate the performance of clustering. I would appreciate if you can share with us your ideas about how to evaluate the performance.
To evaluate the "best clustering" is somewhat ambiguous, especially if you have points in two different groups that may eventually cross over with respect to their features. When you get this case, how exactly do you define which cluster those points get merged to? Here's an example from the Fisher Iris dataset that you can get preloaded with MATLAB. Let's specifically take the sepal width and sepal length, which is the third and fourth columns of the data matrix, and plot the setosa and virginica classes:
load fisheriris;
plot(meas(101:150,3), meas(101:150,4), 'b.', meas(51:100,3), meas(51:100,4), 'r.', 'MarkerSize', 24)
This is what we get:
You can see that towards the middle, there is some overlap. You are lucky in that you knew what the clusters were before hand and so you can measure what the accuracy is, but if we were to get data such as the above and we didn't know what labels each point belonged to, how do you know which cluster the middle points belong to?
Instead, what you should do is try and minimize these classification errors by running kmeans more than once. Specifically, you can override the behaviour of kmeans by doing the following:
idx = kmeans(X, 2, 'Replicates', num);
The 'Replicates' flag tells kmeans to run for a total of num times. After running kmeans num times, the output memberships are those which the algorithm deemed to be the best over all of those times kmeans ran. I won't go into it, but they determine what the "best" average is out of all of the membership outputs and gives you those.
Not setting the Replicates flag obviously defaults to running one time. As such, try increasing the total number of times kmeans runs so that you have a higher probability of getting a higher quality of cluster memberships. By setting num = 10, this is what we get with your data:
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
2
2
2
1
1
You'll see that the first three points belong to one cluster while the last two points belong to another. Even though the IDs are flipped, it doesn't matter as we want to be sure that there is a clear separation between the groups.
Minor note with regards to random algorithms
If you take a look at the comments above, you'll notice that several people tried running the kmeans algorithm on your data and they received different clustering results. The reason why is because when kmeans chooses the initial points for your cluster centres, these are chosen in a random fashion. As such, depending on what state their random number generator was in, it is not guaranteed that the initial points chosen for one person will be the same as another person.
Therefore, if you want reproducible results, you should set the random seed of your random seed generator to be the same before running kmeans. On that note, try using rng with an integer that is known before hand, like 123. If we did this before the code above, everyone who runs the code will be able to reproduce the same results.
As such:
rng(123);
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
1
1
1
2
2
Here the labels are reversed, but I guarantee that if any else runs the above code, they will get the same labelling as what was produced above each time.
I've been trying for a long time to figure out how to perform (on paper)the K-medoids algorithm, however I'm not able to understand how to begin and iterate. for example:
I have the distance matrix between 6 points, the k,C1 and C2.
I'll be very happy if someone can show me please how to perform the K-medoids algorithm on this example? how to start and iterate?
Thanks
A bit more of details then:
Set K to the desired number of clusters, lets use 2.
Choose randomly K entities to be the medoids m_1, m_2. Lets choose X_3 (Lets call this cluster 1) and X_5 (Cluster 2).
Assign a given entity to the cluster represented by its closest medoid. Cluster 1 will be made of entities (X_1, X_2, X_3 - just check your table, these are closer to X_3 than to X_5), cluster 2 will be (X_4, X_5, X_6).
Update the medoids. A medoid of a cluster should be the entity with the smallest sum of distances to all other entities within the same cluster. X_2 will be the new medoid for cluster 1, and X_4 for cluster 2.
Now what you have to do repeat steps 3-4 until convergence. So,
5- Assign each entity to the cluster of the closest medoid (now these are X_2 and X_4). Cluster one is now made of entities (X_1, X_2, X_3 and X_6), Cluster 2 will be (X_4, X_5).
(there was a change in the entities in each cluster, so iterations must continue.
6- The entity with the smallest sum of distances in cluster one is still X_2, in cluster 2 they are the same, so x_4 stays.
Another iteration
7- As there was no change in the medoids, the clusters will stay the same. This means its time to stop the iterations
Output: 2 clusters. Cluster 1 has entities (X_1, X_2, X_3, X_6), and cluster 2 has entities (X_4 and X_5).
Now, if I had started this using different initial medoids maybe I'd get a different clustering... you may wish to check the build algorithm for initialisation.
You have clusters C1 and C2 given.
Find the most central element in each cluster.
Compute new C1 and C2.
Repeat 1. and 2. until convergence