How can I cluster points which are connected in MATLAB? - matlab

Imagine we have many points which some of them are connected together and we want to cluster them.
Please take a look at the following figure.
If we have the "connectivity matrix" of points, how we can cluster them in two group (groups of connected points)?
ConnectivityMatrix=
[1 2
1 3
2 4
2 3
2 1
3 1
3 2
3 4
4 3
4 2
5 8
5 7
5 6
6 5
6 7
7 6
7 5
7 8
8 7
8 5]
The final result should be nodes of 1,2,3,4 in a first group(cluster) and nodes of 5,6,7,8 in a second group (cluster).

Here is some code to get you started. I basically implemented Depth First Search... a very crude implementation of it anyway.
Depth First Search is an algorithm that is used for the traversal of trees. Graphs are essentially a special case of trees where there are leaf nodes that connect back to the root. The basic algorithm for Depth First Search is as so:
Start at the root of the tree and add this to a stack
For each node that is connected to the root, add this onto the stack and place this in a list of visited nodes
While there is still a node on the stack...
Pop off a node off the stack
Check the visited nodes list. If this is a node we have already visited, skip.
Else, visit any nodes that are connected to this node we popped off and add to the stack
If we have disconnected graphs like what you have above, we basically run Depth First Search multiple times. Each time will be for one cluster. After one Depth First Search result, we will discover nodes that belong to one cluster. We restart Depth First Search again with any node that we have not touched yet, which will be a node from another cluster that we have not visited. As you clearly have two clusters in your graph structure, we will have to run Depth First Search two times. This is commonly referred to as finding all connected components in an overall graph.
To find the Connected Components, here are the steps that I did:
Create a connectivity matrix
Initialize a Boolean list that tells us whether or not we have visited a node in your graph
Initialize an empty cluster list
Initialize an empty stack that contains which nodes we should visit.
While there is at least one node we need to visit...
Find such a node
Initialize our stack to contain this node
While our stack is not empty
Pop off a node from the stack
If we have visited this node, continue
Else mark as visited
Retrieve all nodes connected to this node
Remove those nodes that are not on the stack in (4)
Add these nodes to the stack and the cluster list
Once the stack is empty, we have a list of all of the nodes contained in a single cluster. Add this cluster to a final list.
Repeat 1 - 6 until we visit all nodes
Without further ado, this is the code. Bear in mind this is not battle tested. If you have graph structures that will generate an error, that'll be on your own to fix :)
ConnectivityMatrix = [1 2
1 3
2 4
2 3
2 1
3 1
3 2
3 4
4 3
4 2
5 8
5 7
5 6
6 5
6 7
7 6
7 5
7 8
8 7
8 5];
%// Find all possible node IDs
nodeIDs = unique(ConnectivityMatrix(:));
%// Flag that tells us if there are any nodes we should visit
nodeIDList = false(1,numel(nodeIDs));
%// Stores our list of clusters
clusterList = {};
%// Keeps track of how many clusters we have
counter = 1;
%// Stack - initialize to empty
stackNodes = [];
%// While there is at least one node we need to visit
while (~all(nodeIDList))
% Find any node
stackNodes = find(nodeIDList == false, 1);
% Initialize our stack to contain this node
nodesCluster = stackNodes;
%// While our stack is not empty
while (~isempty(stackNodes))
% Grab the node off the stack and pop off
node = stackNodes(end);
stackNodes(end) = [];
%// If we have marked this node as visited, skip
if (nodeIDList(node))
continue;
end
%// Mark as visited
nodeIDList(node) = true;
%// Retrieve all nodes connected to this node
connectedNodes = ConnectivityMatrix(ConnectivityMatrix(:,1) == node, :);
nodesToVisit = unique(connectedNodes(:,2).');
%// Remove those already visited
visitedNodes = ~nodeIDList(nodesToVisit);
finalNodesToVisit = nodesToVisit(visitedNodes);
%// Add to cluster
nodesCluster = unique([nodesCluster finalNodesToVisit]);
%// Add to stack
stackNodes = unique([stackNodes finalNodesToVisit]);
end
%// Add connected components to its own cluster
clusterList{counter} = nodesCluster;
counter = counter + 1;
end
Once we have run this code, we can display our clusters like so:
celldisp(clusterList)
clusterList{1} =
1 2 3 4
clusterList{2} =
5 6 7 8
As such, cluster #1 contains nodes 1,2,3,4 while cluster #2 contains nodes 5,6,7,8.
Bear in mind that this code will only work if you sequentially label your nodes as you did in your diagram. You can't skip any label numbers (i.e. you can't do 1,2,4,6,9, etc. This should be 1,2,3,4,5).
Good luck!

You can use "off-the-shelf" Matlab commands for this problem. For example, you can use graphconncomp.

The answer from rayryeng is pretty good. However, here are some details I would like to point out:
"numel(nodeIDs)" would bring in possible errors if your node label is not sequentially (i.e., you have 10 nodes and the maximum node label is 20). You could switch "numel(nodeIDs)" to "max(nodeIDs) or a larger value."
I also believe the following code would bring in some problems (i.e., some nodes are missing and become isolated nodes) when utilizing this function:
connectedNodes = ConnectivityMatrix(ConnectivityMatrix(:,1) == node, :);
nodesToVisit = unique(connectedNodes(:,2).');
I modified the simple two lines with the following messy code:
connectedNodes1 = ConnectivityMatrix (ConnectivityMatrix (:,1) == node, :);
connectedNodes2 = ConnectivityMatrix (ConnectivityMatrix (:,2) == node, :);
AC=connectedNodes1(:,2).';
AD=connectedNodes2(:,1).';
ACA=reshape(AC,[],1);
ADA=reshape(AD,[],1);
AE= [ACA; ADA] ;
AEA=reshape(AE,[],1);
AEA=AEA';
nodesToVisit = unique(AEA);
After modifying these two points, there are no problems with rayryeng's initial code.

Related

Translating a kd-tree in MATLAB

I'm using a kd-tree to perform quick nearest neighbor search queries. I'm using the following piece of code to generate the kd-tree and perform queries on it:
% 3 dimensional vertex data
x = [1 2 2 1 2 5 6 3 4;
3 2 3 2 2 7 6 5 2;
1 2 9 9 7 5 8 9 3]';
% create the kd-tree
kdtree = createns(x, 'NSMethod', 'kdtree');
% perform a nearest neighbor search
nearestNeighborIndex = knnsearch(kdtree, [1 1 1]);
This works well enough for when the data is static. However, every once in a while, I need to translate every vertex on the kd-tree. I know that changing the whole data means I need to re-generate the whole tree to perform a nearest neighbor search again. Having a couple of thousand vertices for each kd-tree, re-generating the whole tree from scratch seems to me like an overkill as it takes a significant amount of time. Is there a way to translate the kd-tree without re-generating it from scratch? I tried accessing and changing the X property (which holds the actual vertex data) of the kd-tree but it seems to be read-only, and it probably wouldn't have worked even if it wasn't since there is a lot more going on behind the curtains.

Find cluster of points in adjacency list [duplicate]

Imagine we have many points which some of them are connected together and we want to cluster them.
Please take a look at the following figure.
If we have the "connectivity matrix" of points, how we can cluster them in two group (groups of connected points)?
ConnectivityMatrix=
[1 2
1 3
2 4
2 3
2 1
3 1
3 2
3 4
4 3
4 2
5 8
5 7
5 6
6 5
6 7
7 6
7 5
7 8
8 7
8 5]
The final result should be nodes of 1,2,3,4 in a first group(cluster) and nodes of 5,6,7,8 in a second group (cluster).
Here is some code to get you started. I basically implemented Depth First Search... a very crude implementation of it anyway.
Depth First Search is an algorithm that is used for the traversal of trees. Graphs are essentially a special case of trees where there are leaf nodes that connect back to the root. The basic algorithm for Depth First Search is as so:
Start at the root of the tree and add this to a stack
For each node that is connected to the root, add this onto the stack and place this in a list of visited nodes
While there is still a node on the stack...
Pop off a node off the stack
Check the visited nodes list. If this is a node we have already visited, skip.
Else, visit any nodes that are connected to this node we popped off and add to the stack
If we have disconnected graphs like what you have above, we basically run Depth First Search multiple times. Each time will be for one cluster. After one Depth First Search result, we will discover nodes that belong to one cluster. We restart Depth First Search again with any node that we have not touched yet, which will be a node from another cluster that we have not visited. As you clearly have two clusters in your graph structure, we will have to run Depth First Search two times. This is commonly referred to as finding all connected components in an overall graph.
To find the Connected Components, here are the steps that I did:
Create a connectivity matrix
Initialize a Boolean list that tells us whether or not we have visited a node in your graph
Initialize an empty cluster list
Initialize an empty stack that contains which nodes we should visit.
While there is at least one node we need to visit...
Find such a node
Initialize our stack to contain this node
While our stack is not empty
Pop off a node from the stack
If we have visited this node, continue
Else mark as visited
Retrieve all nodes connected to this node
Remove those nodes that are not on the stack in (4)
Add these nodes to the stack and the cluster list
Once the stack is empty, we have a list of all of the nodes contained in a single cluster. Add this cluster to a final list.
Repeat 1 - 6 until we visit all nodes
Without further ado, this is the code. Bear in mind this is not battle tested. If you have graph structures that will generate an error, that'll be on your own to fix :)
ConnectivityMatrix = [1 2
1 3
2 4
2 3
2 1
3 1
3 2
3 4
4 3
4 2
5 8
5 7
5 6
6 5
6 7
7 6
7 5
7 8
8 7
8 5];
%// Find all possible node IDs
nodeIDs = unique(ConnectivityMatrix(:));
%// Flag that tells us if there are any nodes we should visit
nodeIDList = false(1,numel(nodeIDs));
%// Stores our list of clusters
clusterList = {};
%// Keeps track of how many clusters we have
counter = 1;
%// Stack - initialize to empty
stackNodes = [];
%// While there is at least one node we need to visit
while (~all(nodeIDList))
% Find any node
stackNodes = find(nodeIDList == false, 1);
% Initialize our stack to contain this node
nodesCluster = stackNodes;
%// While our stack is not empty
while (~isempty(stackNodes))
% Grab the node off the stack and pop off
node = stackNodes(end);
stackNodes(end) = [];
%// If we have marked this node as visited, skip
if (nodeIDList(node))
continue;
end
%// Mark as visited
nodeIDList(node) = true;
%// Retrieve all nodes connected to this node
connectedNodes = ConnectivityMatrix(ConnectivityMatrix(:,1) == node, :);
nodesToVisit = unique(connectedNodes(:,2).');
%// Remove those already visited
visitedNodes = ~nodeIDList(nodesToVisit);
finalNodesToVisit = nodesToVisit(visitedNodes);
%// Add to cluster
nodesCluster = unique([nodesCluster finalNodesToVisit]);
%// Add to stack
stackNodes = unique([stackNodes finalNodesToVisit]);
end
%// Add connected components to its own cluster
clusterList{counter} = nodesCluster;
counter = counter + 1;
end
Once we have run this code, we can display our clusters like so:
celldisp(clusterList)
clusterList{1} =
1 2 3 4
clusterList{2} =
5 6 7 8
As such, cluster #1 contains nodes 1,2,3,4 while cluster #2 contains nodes 5,6,7,8.
Bear in mind that this code will only work if you sequentially label your nodes as you did in your diagram. You can't skip any label numbers (i.e. you can't do 1,2,4,6,9, etc. This should be 1,2,3,4,5).
Good luck!
You can use "off-the-shelf" Matlab commands for this problem. For example, you can use graphconncomp.
The answer from rayryeng is pretty good. However, here are some details I would like to point out:
"numel(nodeIDs)" would bring in possible errors if your node label is not sequentially (i.e., you have 10 nodes and the maximum node label is 20). You could switch "numel(nodeIDs)" to "max(nodeIDs) or a larger value."
I also believe the following code would bring in some problems (i.e., some nodes are missing and become isolated nodes) when utilizing this function:
connectedNodes = ConnectivityMatrix(ConnectivityMatrix(:,1) == node, :);
nodesToVisit = unique(connectedNodes(:,2).');
I modified the simple two lines with the following messy code:
connectedNodes1 = ConnectivityMatrix (ConnectivityMatrix (:,1) == node, :);
connectedNodes2 = ConnectivityMatrix (ConnectivityMatrix (:,2) == node, :);
AC=connectedNodes1(:,2).';
AD=connectedNodes2(:,1).';
ACA=reshape(AC,[],1);
ADA=reshape(AD,[],1);
AE= [ACA; ADA] ;
AEA=reshape(AE,[],1);
AEA=AEA';
nodesToVisit = unique(AEA);
After modifying these two points, there are no problems with rayryeng's initial code.

Find node index of connected graph components using conncomp

Problem
I am trying to find the connected components of my undirected graph.
Matlabs function conncomp does exactly this. Mathworks - connected graph components
Example
Using the example given on matlabs webpage to keep it easy and repeatable:
G = graph([1 1 4],[2 3 5],[1 1 1],6);
plot(G)
bins = conncomp(G)
bins =
1 1 1 2 2 3
Two Question´s to this
First Question: Using this how can I find the initial node index, so that
cluster1 = (1 2 3); (instead of ( 1 1 1))
cluster2= (4 5); (instead of (2 2))
Second Question:
I am working on a big dataset and I know many nodes are not connected, so is there a way to only display clusters that contain more than one value ?
Thanks for your help, I am majorly stuck here.
You can use splitapply for the first part, like so:
clusters = splitapply(#(x) {x}, 1:numnodes(G), bins)
This returns a cell array where each cell contains the indices of the nodes in a group. You can filter this down in the usual way using cellfun
discard = cellfun(#isscalar, clusters);
clusters(discard) = [];
(Note that splitapply is new in R2015b - but the OP is using graph, also new in R2015b, so it should be fine for them)
Actually the first part of the question can be answered very simply as Matlabs conncomp provides a tool for this:
bins=conncomp(G,'OutputForm','cell');
Creates a cell array that contains the clusters, with all node names in the cells.
For the second part of the question I guess there are several ways but this one could be used as well:
clusters= bins(cellfun(#numel,bins)>1);

to understand the phytree object in matlab

I asked the similar question here: what exactly is the phytree object in matlab?.
Now this is what I did to try to get it.
clear;
d=[4,2,5,4,5,5];
z=seqneighjoin(d);
view(z)
get(z, 'Pointers')
This is the output:
ans =
1 2
3 5
4 6
And the phytree figure in the following. For my understanding, this matrix is the same as the tree field of the phytree object. What is the relation between this matrix and the figure?
You should interpret the array in the following way.
Firstly, you have the four nodes 1, 2, 3 and 4. In the graph you attach, node 1 is labelled Leaf 1; node 2 is labelled Leaf 3; node 3 is labelled Leaf 2; and node 4 is labelled Leaf 4.
Then take each row of the array in turn.
The first row indicates that we first join nodes 1 and 2 - we now call this node 5, as it's the smallest number greater than the four nodes we already have. On the graph, this is the node connecting Leaf 1 and Leaf 3.
Then the second row indicates that we next join nodes 3 and 5 - we now call this node 6, as again it's the smallest number after node 5. On the graph, this is the node connecting the previous join to Leaf 2.
Then the third row indicates that we lastly join nodes 4 and 6 - we don't need to call it anything as it's the final root node, but it would be node 7. On the graph, this is the node connecting the previous join to Leaf 4.
Does that make more sense?

Nearest Neighbour Classifier for multiple features

I have a dataset set that looks like this:
Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Class
Obj 2 2 2 8 5 1
Obj 2 8 3 3 4 2
Obj 1 7 4 4 8 1
Obj 4 3 5 9 7 2
The rows contain objects, which have a number of features. I have put 5 features for demonstration purposes but there is approximately 50 features per object, with the final column being the class label for each object.
I want to create and run the nearest neighbour classifier algorithm on this data set and retrieve the error rate.I have managed to get the NN algorithm working for each feature, a short Pseudo code example is below. For each feature I loop through each object, assigning object j according to its nearest neighbours.
for i = 1:Number of features
for j = 1:Number of objects
distance between data(j,i) and values of feature i
order by shortest distance
sum or the class labels k shortest distances
assign class with largest number of labels
end
error = mean(labels~=assigned)
end
The issue I have is how would I work out the 1-NN algorithm for multiple features. I will have a selection of the features from my dataset say features 1,2 and 3. I want to calculate the error rate if I add feature 5 into my set of selected features. I want to work out the error using 1NN. Would I find the nearest value out of all my features 1-3 in my selected feature?
For example, for my data set above:
Adding feature 5 - For object 1 of feature 5 the closest number to that is object 4 of feature 3. As this has a class label of 2 I will assign object 1 of feature 5 the class 2. This is obviously a misclassification but I would continue to classify all other objects in Feature 5 and compare the assigned and actual values.
Is this the correct way to perform the 1NN against multiple features?