Kmeans matlab "Empty cluster created at iteration 1" error - matlab

I'm using this script to cluster a set of 3D points using the kmeans matlab function but I always get this error "Empty cluster created at iteration 1".
The script I'm using:
[G,C] = kmeans(XX, K, 'distance','sqEuclidean', 'start','sample');
XX can be found in this link XX value and the K is set to 3
So if anyone could please advise me why this is happening.

It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is usually caused by an inadequate cluster initialization, or that the data has less inherent clusters than you specified.
Try changing the initialization method using the start option. Kmeans provides four possible techniques to initialize clusters:
sample: sample K points randomly from the data as initial clusters (default)
uniform: select K points uniformly across the range of the data
cluster: perform preliminary clustering on a small subset
manual: manually specify initial clusters
Also you can try the different values of emptyaction option, which tells MATLAB what to do when a cluster becomes empty.
Ultimately, I think you need to reduce the number of clusters, i.e try K=2 clusters.
I tried to visualize your data to get a feel for it:
load matlab_X.mat
figure('renderer','zbuffer')
line(XX(:,1), XX(:,2), XX(:,3), ...
'LineStyle','none', 'Marker','.', 'MarkerSize',1)
axis vis3d; view(3); grid on
After some manual zooming/panning, it looks like a silhouette of a person:
You can see that the data of 307200 points is really dense and compact, which confirms what I suspected; the data doesnt have that many clusters.
Here is the code I tried:
>> [IDX,C] = kmeans(XX, 3, 'start','uniform', 'emptyaction','singleton');
>> tabulate(IDX)
Value Count Percent
1 18023 5.87%
2 264690 86.16%
3 24487 7.97%
Whats more, the entire points in cluster 2 are all duplicate points ([0 0 0]):
>> unique(XX(IDX==2,:),'rows')
ans =
0 0 0
The other two clusters look like:
clr = lines(max(IDX));
for i=1:max(IDX)
line(XX(IDX==i,1), XX(IDX==i,2), XX(IDX==i,3), ...
'Color',clr(i,:), 'LineStyle','none', 'Marker','.', 'MarkerSize',1)
end
So you might get better clusters if you first remove duplicate points first...
In addition, you have a few outliers that might affect the result of clustering. Visually, I narrowed down the range of the data to the following intervals which encompasses most of the data:
>> xlim([-500 100])
>> ylim([-500 100])
>> zlim([900 1500])
Here is the result after removing dupe points (over 250K points) and outliers (around 250 data points), and clustering with K=3 (best of out of 5 runs with the replicates option):
XX = unique(XX,'rows');
XX(XX(:,1) < -500 | XX(:,1) > 100, :) = [];
XX(XX(:,2) < -500 | XX(:,2) > 100, :) = [];
XX(XX(:,3) < 900 | XX(:,3) > 1500, :) = [];
[IDX,C] = kmeans(XX, 3, 'replicates',5);
with almost an equal split across the three clusters:
>> tabulate(IDX)
Value Count Percent
1 15605 36.92%
2 15048 35.60%
3 11613 27.48%
Recall that the default distance function is euclidean distance, which explains the shape of the formed clusters.

If you are confident with your choice of "k=3", here is the code I wrote for not getting an empty cluster:
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
while length(unique(IDX))<3 || histc(histc(IDX,[1 2 3]),1)~=0
% i.e. while one of the clusters is empty -- or -- we have one or more clusters with only one member
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
end

Amro described the reason clearly:
It is simply telling you that during the assign-recompute iterations,
a cluster became empty (lost all assigned points). This is usually
caused by an inadequate cluster initialization, or that the data has
less inherent clusters than you specified.
But the other option that could help to solve this problem is emptyaction:
Action to take if a cluster loses all its member observations.
error: Treat an empty cluster as an error (default).
drop: Remove any clusters that become empty. kmeans sets the corresponding return values in C and D to NaN. (for information
about C and D see kmeans documentioan page)
singleton: Create a new cluster consisting of the one point furthest from its centroid.
An example:
Let’s run a simple code to see how this option changes the behavior and results of kmeans. This sample tries to partition 3 observations in 3 clusters while 2 of them are located at same point:
clc;
X = [1 2; 1 2; 2 3];
[I, C] = kmeans(X, 3, 'emptyaction', 'singleton');
[I, C] = kmeans(X, 3, 'emptyaction', 'drop');
[I, C] = kmeans(X, 3, 'emptyaction', 'error')
The first call with singleton option displays a warning and returns:
I = C =
3 2 3
2 1 2
1 1 2
As you can see two cluster centroids are created at same location ([1 2]), and two first rows of X are assigned to these clusters.
The Second call with drop option also displays same warning message, but returns different results:
I = C =
1 1 2
1 NaN NaN
3 2 3
It just returns two cluster centers and assigns two first rows of X to same cluster. I think most of the times this option would be most useful. In cases that observations are too close and we need as more cluster centers as possible, we can let MATLAB decide about the number. You can remove NaN rows form C like this:
C(any(isnan(C), 2), :) = [];
And finally the third call generates an exception and halts the program as expected.
Empty cluster created at iteration 1.

Related

In Matlab, how can I remove nodes with no edges in a directed graph?

I enter this:
plot(digraph([1 2 3 10],[2 3 1 1]))
And the figure shows
How can I remove the nodes 8,9,4,5,6 and 7? Is there a setting to not show any nodes that dont have edges?
You can manually relabel the edges using unique, so that there are no holes in the list of node numbers. To maintain the original node names, pass them as fourth input to digraph in the form of a cell array of strings:
S = [1 2 3 10];
T = [2 3 1 1];
[u, ~, w] = unique([S T]);
plot(digraph(w(1:end/2), w(end/2+1:end), [], cellstr(num2str(u.'))))
Another way, by using matlab functions created exactly for graphs / networks, i.e. indegree, outdegree, and rmnode.
Notes:
to always keep the same labels / names for the nodes, please assign the labels / names to the function digraph.
use the indegree and outdegree functions to find the isolated nodes, i.e. with both indegree and outdegree equal to zero. Just for a sake of completeness: if you had an undirected graph / network, you could simply use the degree function to spot the isolated nodes (use this: find(degree(N) == 0)).
use the rmnode function to remove the isolated nodes from your graph / network.
Here you are the relative code:
names_of_nodes = string(1:10);
N = digraph([1 2 3 10],[2 3 1 1],[], names_of_nodes);
isolated_nodes = find(indegree(N) + outdegree(N) == 0);
N = rmnode(N,isolated_nodes);
plot(N)

coloring dendogram object in matlab

I have a dendogram looking like object. (in this example, for simplisity, I used phytreeread from Bioinforamtics toolbox)
tree.newick = '(A1:7,(B1:5,((G1:3, H1:3)C1:1, D1:4)F1:1)E1:2)O1:0';
mytree = phytreeread(tree.newick);
h = plot(mytree,'Type','square')
I also have of terminal nodes as well as internal nodes.
I can list the edges with
h.BranchLines
and change its color with
set(h.BranchLines(1),'Color',[1 1 0])
but, to make it automatic, I need to know, what are the nodes, at two ends of h.BranchLines(1), and I can't figure it out.
Basically, I am after making a function color_edge(A, B, color), to color the edge between node A and node B.
In other words, I would like to know either how,
1) for a given h.BranchLines(1) - how to figure out two ending nodes id
2) for a given two nodes, say A1 and O1 figure out, what is the which h.Branchlines is connecting them.
Code to reproduce the plot - you would need to have the bioinformatics toolbox
tree.newick = '(A1:7,(B1:5,((G1:3, H1:3)C1:1, D1:4)F1:1)E1:2)O1:0';
tree.root = {'O1'};
mytree = phytreeread(tree.newick);
phytreeviewer(mytree)
h = plot(mytree,'Type','square')
set(h.BranchLines(1),'Color',[1 1 0])
set(h.BranchLines(4),'Color',[0 1 1])
You can use getmatrix function to extract the relationship, e.g. in your case:
>> [matrix, id] = getmatrix(mytree)
matrix =
(9,1) 1
(8,2) 1
(6,3) 1
(6,4) 1
(7,5) 1
(7,6) 1
(8,7) 1
(9,8) 1
id =
'A1'
'B1'
'G1'
'H1'
'D1'
'C1'
'F1'
'E1'
'O1'
It is not documented anywhere that the order of non-zero entries in the sparse matrix corresponds to the order of line handles in BranchLines, but it does seem to be the case (apparently, by construction). So e.g. the first handle would connect nodes 9 and 1, i.e. O1 and A1, according to id mapping.
Worth noting that plot and getmatrix are not built-in functions, so you can just examine the source code to verify this assumption (though they are not the easiest to read). Sources are located under \toolbox\bioinfo\bioinfo\#phytree\.
Alternatively, we can do a quick check by showing branch labels on the chart, e.g. for the first line:
h = plot(mytree, 'Type', 'square', 'BranchLabels', true);
set(h.BranchLines(1), 'Color', 'r')
Finally, this is how you can read those non-zero node indices programmatically and in the right order:
[i,j] = ind2sub(size(matrix), find(matrix))
i =
9
8
6
6
7
7
8
9
j =
1
2
3
4
5
6
7
8
P.S. There is obviously absolutely no guarantee it will not break in the future release.

Matlab - submatrix for stiffness method

In order to use the stiffness method for trusses, I need to extract certain elements from a large global stiffness matrix.
Say I have a 9 x 9 matrix K representing a three-member truss. This means that the first 3 rows and columns correspond to the first node, the second set of three rows and columns with the second node, and the third with the third node. In the code is a vector zDisp that corresponds to each node that has zero displacement. On paper, a zero displacement of a node means you would cross out the rows and columns corresponding to that displacement, leaving you with a smaller and easier to work with K matrix. So if the first and third nodes have zero displacement, you would be left with a 3 x 3 matrix corresponding to the intersection of the middle three rows and the middle three columns.
I thought I could accomplish this one node at a time with a function like so:
function [ B ] = deleteNode( B, node )
%deleteNode removes the corresponding rows and vectors to a node that has
% zero deflection from the global stiffness matrix
% --- Problem line - this gets the first location in the matrix corresponding to the node
start = 3*node- 2;
for i = 0 : 2
B(start+i,:) = [];
B(:,start+i) = [];
end
end
So my main project would go something like
% Zero displacement nodes
zDisp = [1;
3;
];
% --- Create 9 x 9 global matrix Kg ---
% Make a copy of the global matrix
S = Kg;
for(i = 1 : length(zDisp))
S = deleteNode(S, zDisp(i));
end
This does not work because once the loop executes for node 1 and removes the first 3 rows and columns, the problem line in the function no longer works to find the correct location in the smaller matrix to find the node.
So I think this step needs to be executed all at once. I am thinking I may need to instead input which nodes are NOT zero displacement, and create a submatrix based off of that. Any tips on this? Been thinking on it awhile. Thanks all.
In your example, you want to remove rows/columns 1, 2, 3, 7, 8, and 9, so if zDisp=[1;3],
remCols=bsxfun(#plus,1:3,3*(zDisp-1))
If I understand correctly, you should just be able to first remove the columns given by zDisp:
S(remCols(:),:)=[]
then remove the rows:
S(:,remCols(:))=[]

Possible "Traveling Salesman" function in Matlab?

I am looking to solve a Traveling Salesman type problem using a matrix in order to find the minimum time between transitions. The matrix looks something like this:
A = [inf 4 3 5;
1 inf 3 5;
4 5 inf 3;
6 7 1 inf]
The y-axis represents the "from" node and the x-axis represents the "to" node. I am trying to find the optimal time from node 1 to node 4. I was told that there is a Matlab function called "TravellingSalesman". Is that true, and if not, how would I go about solving this matrix?
Thanks!
Here's an outline of the brute-force algorithm to solve TSP for paths from node 1 to node n:
C = inf
P = zeros(1,n-2)
for each permutation P of the nodes [2..n-1]
// paths always start from node 1 and end on node n
C = A(1,P(1)) + A(P(1),P(2)) + A(P(2),P(3)) + ... +
A(P(n-3),P(n-2)) + A(P(n-2),n)
if C < minCost
minCost = C
minPath = P
elseif C == minCost // you only need this part if you want
minPath = [minPath; P] // ALL paths with the shortest distance
end
end
Note that the first and last factors in the sum are different because you know beforehand what the first and last nodes are, so you don't have to include them in the permutations. So in the example given, with n=4, there are actually only 2!=2 possible paths.
The list of permutations can be precalculated using perms(2:n-1), but that might involve storing a large matrix (n! x n). Or you can calculate the cost as you generate each permutation. There are several files on the Mathworks file exchange with names like nextPerm that should work for you. Either way, as n grows you're going to be generating a very large number of permutations and your calculations will take a very long time.

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).