Get observations per node in cluster - matlab

After creating a cluster from some data (using an example of 6 observations), I want to get the observations from each node that the tree holds.
For the given example:
Node5 [1,2,3,4,5,6]
Node4 [1,2,3,5,6]
Node3 [2,3,5,6]
...and so on
So far I have used this code, with n being the number of observations in linkDist, which is an an agglomerative hierarchical cluster tree:
for i=1:n-1
clusterVals = cluster(linkDist,'maxClust',i);
k = find(clusterVals==i);
end
The problem is that the cluster numeration is changing due to the iterations. For example
cluster(linkDist,'maxClust',2) % [2,2,1,2,2,2]
cluster(linkDist,'maxClust',3) % [2,2,3,2,1,2]
For the following tree:
Is there a solution for my problem?
Thank you very much!

Related

With networks, how to find first node(s) in a DiGraph

Using networks, which is the direct way to find the first node on a directed graph.
There might be more than one and there are not isolated nodes.
The first nodes I mean the nodes without ancestors.
Best regards and thank you in advance,
Pablo
You can look at in_degree. A node with no edges pointing to it will have an in_degree of 0.
# make dummy graph
nodes = np.arange(10)
edges = [np.random.choice(nodes, 2) for a in range(10)]
G = nx.DiGraph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
# find the nodes whose in_degree is 0
[node for node, in_degree in G.in_degree if in_degree==0]

How to get a list/vector of which clusters a node in a network has belonged to when the clusters change at each timestep?

I have used kmeans to cluster a population in Matlab and then I run a disease in the population and nodes that have the disease more than 80% of the time are excluded from the clustering - meaning the clusters change each iteration. Then it reclusters over 99 timesteps. How do I create a vector/list of which clusters a specific node has belonged to over the whole time period?
I have tried using the vector created by kmeans called 'id' but this doesn't include the nodes that are excluded from the clustering so I cannot track one specific node as the size of id changes each time. This is what I tried and ran it in the for loop so it plotted a line plot for each iteration:
nt = [nt sum(id(1,:))];
Only problem was that the first row in the vector obviously changed every timestep so it wasn't the same person.
This is my initial simple clustering:
%Cluster the population according to these features
[id, c] = kmeans(feats', 5);
Then this is the reclustering process to exclude those who have the disease for more than 80% of the time (this part is in a big for loop in the whole code):
Lc = find(m < 0.8);
if t > 1,
[id, c, sumD, D] = kmeans(feats(:, Lc)', 5, 'Start', c);
else,
[id, c, sumD, D] = kmeans(feats(:, Lc)', 5);
end;
I want to be able to plot and track the fate of specific nodes in my population which is why I want to know how their cluster groups change. Any help would be much appreciated!

Kmeans cluster evaluation

I am little bit confused with SSB calculation in Cluster evaluation
Where
|Ci| is the size of cluster i
ci is the centroid of cluster i
c is the centroid of the overall data
What is this "centroid of the overall data"?
Everywhere it is mentioned as centroid of overall data.
Is it the intial centroid that we take for calculation?
EDIT
Little more clarification from anony-Mousse's answer.
Lets say we have done 1 iteration in clustering.
step 1: k =2, select random centroids(Let my random centroids be (2,1,3) and (3,1,1))
step 2: do clustering(Now 2 clusters are formed)
step 3: then find new centroids(by averaging data for each cluster, After averaging let my new clusters be (2.3,1.5,3) and (6.7,1,2))
so now I need to calculate SSB.
Now I need to calculate centroid for whole data(input data) let that value be (25,30.5,78)
total no of values in c1 = 20
total no of values in c2 = 30
ssbc1 = 20*(dist([2.3,1.5,3],[25,30.5,78]))^2
ssbc1 = 30*(dist([6.7,1,2],[25,30.5,78]))^2
total ssb = ssbc1+ssbc2
Is it like this?
The centroid is the average in each dimension.
"Of all data" says that the clustering is not used.

Find the cross node for number of nodes in Gremlin?

I have a number of nodes connected through intermediate node of other type. Like on picture
There are can be multiple middle nodes.
I need to find all the middle nodes for a given number of nodes and sort it by number of links between my initial nodes. In my example given A, B, C, D it should return node E (4 links) folowing node F (3 links). Is this possible? If not may be it can be done using multiple requests?
With the toy graph.
Let's assume vertex 1 and 6 are given:
g = TinkerGraphFactory.createTinkerGraph()
m=[:]
g.v(1,6).both.groupCount(m)
m.sort{-it.value}
Sorted Map m contains:
==>v[3]=2
==>v[2]=1
==>v[4]=1

how to perform K-medoids

I've been trying for a long time to figure out how to perform (on paper)the K-medoids algorithm, however I'm not able to understand how to begin and iterate. for example:
I have the distance matrix between 6 points, the k,C1 and C2.
I'll be very happy if someone can show me please how to perform the K-medoids algorithm on this example? how to start and iterate?
Thanks
A bit more of details then:
Set K to the desired number of clusters, lets use 2.
Choose randomly K entities to be the medoids m_1, m_2. Lets choose X_3 (Lets call this cluster 1) and X_5 (Cluster 2).
Assign a given entity to the cluster represented by its closest medoid. Cluster 1 will be made of entities (X_1, X_2, X_3 - just check your table, these are closer to X_3 than to X_5), cluster 2 will be (X_4, X_5, X_6).
Update the medoids. A medoid of a cluster should be the entity with the smallest sum of distances to all other entities within the same cluster. X_2 will be the new medoid for cluster 1, and X_4 for cluster 2.
Now what you have to do repeat steps 3-4 until convergence. So,
5- Assign each entity to the cluster of the closest medoid (now these are X_2 and X_4). Cluster one is now made of entities (X_1, X_2, X_3 and X_6), Cluster 2 will be (X_4, X_5).
(there was a change in the entities in each cluster, so iterations must continue.
6- The entity with the smallest sum of distances in cluster one is still X_2, in cluster 2 they are the same, so x_4 stays.
Another iteration
7- As there was no change in the medoids, the clusters will stay the same. This means its time to stop the iterations
Output: 2 clusters. Cluster 1 has entities (X_1, X_2, X_3, X_6), and cluster 2 has entities (X_4 and X_5).
Now, if I had started this using different initial medoids maybe I'd get a different clustering... you may wish to check the build algorithm for initialisation.
You have clusters C1 and C2 given.
Find the most central element in each cluster.
Compute new C1 and C2.
Repeat 1. and 2. until convergence