I have a dendogram looking like object. (in this example, for simplisity, I used phytreeread from Bioinforamtics toolbox)
tree.newick = '(A1:7,(B1:5,((G1:3, H1:3)C1:1, D1:4)F1:1)E1:2)O1:0';
mytree = phytreeread(tree.newick);
h = plot(mytree,'Type','square')
I also have of terminal nodes as well as internal nodes.
I can list the edges with
h.BranchLines
and change its color with
set(h.BranchLines(1),'Color',[1 1 0])
but, to make it automatic, I need to know, what are the nodes, at two ends of h.BranchLines(1), and I can't figure it out.
Basically, I am after making a function color_edge(A, B, color), to color the edge between node A and node B.
In other words, I would like to know either how,
1) for a given h.BranchLines(1) - how to figure out two ending nodes id
2) for a given two nodes, say A1 and O1 figure out, what is the which h.Branchlines is connecting them.
Code to reproduce the plot - you would need to have the bioinformatics toolbox
tree.newick = '(A1:7,(B1:5,((G1:3, H1:3)C1:1, D1:4)F1:1)E1:2)O1:0';
tree.root = {'O1'};
mytree = phytreeread(tree.newick);
phytreeviewer(mytree)
h = plot(mytree,'Type','square')
set(h.BranchLines(1),'Color',[1 1 0])
set(h.BranchLines(4),'Color',[0 1 1])
You can use getmatrix function to extract the relationship, e.g. in your case:
>> [matrix, id] = getmatrix(mytree)
matrix =
(9,1) 1
(8,2) 1
(6,3) 1
(6,4) 1
(7,5) 1
(7,6) 1
(8,7) 1
(9,8) 1
id =
'A1'
'B1'
'G1'
'H1'
'D1'
'C1'
'F1'
'E1'
'O1'
It is not documented anywhere that the order of non-zero entries in the sparse matrix corresponds to the order of line handles in BranchLines, but it does seem to be the case (apparently, by construction). So e.g. the first handle would connect nodes 9 and 1, i.e. O1 and A1, according to id mapping.
Worth noting that plot and getmatrix are not built-in functions, so you can just examine the source code to verify this assumption (though they are not the easiest to read). Sources are located under \toolbox\bioinfo\bioinfo\#phytree\.
Alternatively, we can do a quick check by showing branch labels on the chart, e.g. for the first line:
h = plot(mytree, 'Type', 'square', 'BranchLabels', true);
set(h.BranchLines(1), 'Color', 'r')
Finally, this is how you can read those non-zero node indices programmatically and in the right order:
[i,j] = ind2sub(size(matrix), find(matrix))
i =
9
8
6
6
7
7
8
9
j =
1
2
3
4
5
6
7
8
P.S. There is obviously absolutely no guarantee it will not break in the future release.
Related
EDIT TO REQUIREMENT 6 AND NEW REQUIREMENT ADDED
6) Exactly 4 columns/rows must have degree 3.
7) No two vertices of degree 3 are adjacent to each other.
My goal:
To generate and save all matrices that meet specific requirements. Then compare each matrix to additional matrices that have been manually entered previously to check for specific similarities. I can add more detail if somebody thinks it would be helpful. I believe I have the comparison aspect of the code sorted out already, so I am waiting on the matrix generation portion. I need to do this for multiple sizes but I will focus this question on the 10x10 case.
Specific requirements:
1) Must be a 10x10 matrix (representing a graph on 10 vertices).
2) Must be symmetric (representing an adjacency matrix).
3) Have a diagonal of 0s (no loops).
4) Only 1s and 0s (simple graph).
5) The entire matrix must have exactly 48 1s (the graph has 24 edges).
6) Each column/row must have either 3 or 6 1s (each node as degree 3 or 6).
Application:
I am investigating a conjecture and believe I have come up with a possible solution which could break down the conjecture into smaller pieces and possibly prove some aspects. I want to use brute force to show if my idea works for a small specific case. Also having a base code in place could allow for future modifications to test other possible cases or ideas.
Ideas and thought process:
I used the edges of a graph to manually input my comparison set. For example:
G9=graph([1 1 1 2 2 3 4 4 4 5 5 6 6 6 6 3 3 9 2 2 2 7 7 8],[2 3 4 3 4 4 5 6 7 6 7 7 3 9 10 9 10 10 7 8 9 8 9 9]);
I think this is the only graph, up to isomorphism, which meets the previously listed requirements.
My original thought was to create the possible matrices that satisfy the given conditions then compare them to my comparison set. I still think this is the best approach.
I foolishly attempted to generate random matrices, completely overlooking the massive number of possibilities. Using a while loop, I first generated a random matrix that satisfied the first four requirements. Then in separate nested for statements I checked for requirement 5 using numedes() and requirement 6 using all(mod(degree())). That was a bad approach for several fairly obvious reasons, but I learned a lot through the process and it led me to the code that should do my final comparisons.
This is the first time I have used Matlab so I am learning as I go. I have been working on this one code for nearly 2 weeks and do not know if what I have come up with is "good", but I am proud of what I have been able to do by myself. I have reached the point where I feel like I need some outside advice. I am open to any suggestions and any level of help. A reference to a source, a function suggestion, another approach, or a complete solution with a "plug and play" code would be appreciated. I do not shy away from putting forth the effort to achieve my goals.
I appreciate any feedback.
If you want to brute force it, you've got 3773655750150 possible configurations to test for 3-or-6-connectedness. I think you'll probably need more powerful math (Polya Enumeration Theorem? or some other combinatoric theorem I probably forgot) to solve this one.
edit: This recursive solution is much more constrained and is likely to finish in the next century.
E = containers.Map('KeyType', 'int32', 'ValueType', 'any');
for k = 0:9
E(k) = [];
end
foo(E, 3, 0);
foo(E, 6, 0);
function E = foo(E, D, n)
% E : graph edges (map)
% D : degree (3 or 6)
% n : current node
if (n == 9)
e_degree = cellfun(#length,E.values);
if all(e_degree) && all(~mod(e_degree,3))
print_E(E)
end
return
end
e = E(n); % existing edges
m = setdiff((n+1:9), e); % candidate new edges
K = D - length(e);
% if too many edges, return early
if (K < 0)
return
end
C = combnk(m, K);
N = size(C, 1);
for k = 1:N
c = C(k,:);
E(n) = unique([e, c]);
for kv = 1:K
v = c(kv);
E(v) = unique([E(v), n]);
end
% down the rabbit hole
E = foo(E, D, n + 1);
for D = 3:3:6
E = foo(E, D, n + 1);
end
% remove edges added in this loop
E(n) = setdiff(E(n), c);
for kv = 1:K
v = c(kv);
E(v) = setdiff(E(v), n);
end
end
end
function print_E(E)
for k = 0:9
fprintf('%i: ',k);
fprintf('%i ', E(k));
fprintf('\n');
end
fprintf('\n');
end
I enter this:
plot(digraph([1 2 3 10],[2 3 1 1]))
And the figure shows
How can I remove the nodes 8,9,4,5,6 and 7? Is there a setting to not show any nodes that dont have edges?
You can manually relabel the edges using unique, so that there are no holes in the list of node numbers. To maintain the original node names, pass them as fourth input to digraph in the form of a cell array of strings:
S = [1 2 3 10];
T = [2 3 1 1];
[u, ~, w] = unique([S T]);
plot(digraph(w(1:end/2), w(end/2+1:end), [], cellstr(num2str(u.'))))
Another way, by using matlab functions created exactly for graphs / networks, i.e. indegree, outdegree, and rmnode.
Notes:
to always keep the same labels / names for the nodes, please assign the labels / names to the function digraph.
use the indegree and outdegree functions to find the isolated nodes, i.e. with both indegree and outdegree equal to zero. Just for a sake of completeness: if you had an undirected graph / network, you could simply use the degree function to spot the isolated nodes (use this: find(degree(N) == 0)).
use the rmnode function to remove the isolated nodes from your graph / network.
Here you are the relative code:
names_of_nodes = string(1:10);
N = digraph([1 2 3 10],[2 3 1 1],[], names_of_nodes);
isolated_nodes = find(indegree(N) + outdegree(N) == 0);
N = rmnode(N,isolated_nodes);
plot(N)
I am familiar with Matlab but am still having trouble with vectorized methods in my intuition, so I was wondering if anyone could demonstrate how they would manage this problem.
I have an array, for example A = [1 1 2 2 1 3 3 3 4 3 4 4 5].
I want to return an array B such that each element is the index of A's most 'recent' element with a different value than the previous ones.
So for our array A, B would equal [x x 2 2 4 5 5 5 8 9 10 10 12], where the x's can be any consistent value you like, because there is no previous index satisfying those characteristics.
I know how I would code it as a for-loop, and I bet the for-loop is probably faster, but can anyone vectorize this to faster than the for-loop?
Here's my for-loop:
prev=0;
B=zeros(length(A),1);
for i=2:length(A)
if A(i-1)~=A(i)
prev=i-1;
end
B(i)=prev;
end
Find the indices of the entries where the value changes:
ind = find(diff(A) ~= 0);
The values that should appear in B are therefore:
val = [0 ind];
Construct the diff of B: fill in the difference between the values that should appear at the right places:
Bd = zeros(size(B))';
Bd(ind + 1) = diff(val);
Now use cumsum to construct B:
B = cumsum(Bd)
Not sure whether this results in a speed-up though.
I have a matrix like this:
fd =
x y z
2 5 10
2 6 10
3 5 11
3 9 11
4 3 11
4 9 12
5 4 12
5 7 13
6 1 13
6 5 13
I have two parts of my problem:
1) I want to calculate the difference of each two elements in a column.
So I tried the following code:
for i= 1:10
n=10-i;
for j=1:n
sdiff1 = diff([fd(i,1); fd(i+j,1)],1,1);
sdiff2 = diff([fd(i,2); fd(i+j,2)],1,1);
sdiff3 = diff([fd(i,3); fd(i+j,3)],1,1);
end
end
I want all the differences such as:
x1-x2, x1-x3, x1-x4....x1-x10
x2-x3, x2-x4.....x2-x10
.
.
.
.
.
x9-x10
same for y and z value differences
Then all the values should stored in sdiff1, sdiff2 and sdiff3
2) what I want next is for same z values, I want to keep the original data points. For different z values, I want to merge those points which are close to each other. By close I mean,
if abs(sdiff3)== 0
keep the original data
for abs(sdiff3) > 1
if abs(sdiff1) < 2 & abs(sdiff2) < 2
then I need mean x, mean y and mean z of the points.
So I tried the whole programme as:
for i= 1:10
n=10-i;
for j=1:n
sdiff1 = diff([fd(i,1); fd(i+j,1)],1,1);
sdiff2 = diff([fd(i,2); fd(i+j,2)],1,1);
sdiff3 = diff([fd(i,3); fd(i+j,3)],1,1);
if (abs(sdiff3(:,1)))> 1
continue
mask1 = (abs(sdiff1(:,1)) < 2) & (abs(sdiff2(:,1)) < 2) & (abs(sdiff3:,1)) > 1);
subs1 = cumsum(~mask1);
xmean1 = accumarray(subs1,fd(:,1),[],#mean);
ymean1 = accumarray(subs1,fd(:,2),[],#mean);
zmean1 = accumarray(subs1,fd(:,3),[],#mean);
fd = [xmean1(subs1) ymean1(subs1) zmean1(subs1)];
end
end
end
My final output should be:
2.5 5 10.5
3.5 9 11.5
5 4 12
5 7 13
6 1 13
where, (1,2,3),(4,6),(5,7,10) points are merged to their mean position (according to the threshold difference <2) whereas 8 and 9th point has their original data.
I am stuck in finding the differences for each two elements of a column and storing them. My code is not giving me the desired output.
Can somebody please help?
Thanks in advance.
This can be greatly simplified using vectorised notation. You can do for instance
fd(:,1) - fd(:,2)
to get the difference between columns 1 and 2 (or equivalently diff(fd(:,[1 2]), 1, 2)). You can make this more elegant/harder to read and debug with pdist but if you only have three columns it's probably more trouble than it's worth.
I suspect your first problem is with the third argument to diff. If you use diff(X, 1, 1) it will do the first order diff in direction 1, which is to say between adjacent rows (downwards). diff(X, 1, 2) will do it between adjacent columns (rightwards), which is what you want. Matlab uses the opposite convention to spreadsheets in that it indexes rows first then columns.
Once you have your diffs you can then test the elements:
thesame = find(sdiff3 < 2); % for example
this will yield a vector of the row indices of sdiff3 where the value is less than 2. Then you can use
fd(thesame,:)
to select the elements of fd at those indexes. To remove matching rows you would do the opposite test
notthesame = find(sdiff > 2);
to find the ones to keep, then extract those into a new array
keepers = fd(notthesame,:);
These won't give you the exact solution but it'll get you on the right track. For the syntax of these commands and lots of examples you can run e.g. doc diff in the command window.
I'm using this script to cluster a set of 3D points using the kmeans matlab function but I always get this error "Empty cluster created at iteration 1".
The script I'm using:
[G,C] = kmeans(XX, K, 'distance','sqEuclidean', 'start','sample');
XX can be found in this link XX value and the K is set to 3
So if anyone could please advise me why this is happening.
It is simply telling you that during the assign-recompute iterations, a cluster became empty (lost all assigned points). This is usually caused by an inadequate cluster initialization, or that the data has less inherent clusters than you specified.
Try changing the initialization method using the start option. Kmeans provides four possible techniques to initialize clusters:
sample: sample K points randomly from the data as initial clusters (default)
uniform: select K points uniformly across the range of the data
cluster: perform preliminary clustering on a small subset
manual: manually specify initial clusters
Also you can try the different values of emptyaction option, which tells MATLAB what to do when a cluster becomes empty.
Ultimately, I think you need to reduce the number of clusters, i.e try K=2 clusters.
I tried to visualize your data to get a feel for it:
load matlab_X.mat
figure('renderer','zbuffer')
line(XX(:,1), XX(:,2), XX(:,3), ...
'LineStyle','none', 'Marker','.', 'MarkerSize',1)
axis vis3d; view(3); grid on
After some manual zooming/panning, it looks like a silhouette of a person:
You can see that the data of 307200 points is really dense and compact, which confirms what I suspected; the data doesnt have that many clusters.
Here is the code I tried:
>> [IDX,C] = kmeans(XX, 3, 'start','uniform', 'emptyaction','singleton');
>> tabulate(IDX)
Value Count Percent
1 18023 5.87%
2 264690 86.16%
3 24487 7.97%
Whats more, the entire points in cluster 2 are all duplicate points ([0 0 0]):
>> unique(XX(IDX==2,:),'rows')
ans =
0 0 0
The other two clusters look like:
clr = lines(max(IDX));
for i=1:max(IDX)
line(XX(IDX==i,1), XX(IDX==i,2), XX(IDX==i,3), ...
'Color',clr(i,:), 'LineStyle','none', 'Marker','.', 'MarkerSize',1)
end
So you might get better clusters if you first remove duplicate points first...
In addition, you have a few outliers that might affect the result of clustering. Visually, I narrowed down the range of the data to the following intervals which encompasses most of the data:
>> xlim([-500 100])
>> ylim([-500 100])
>> zlim([900 1500])
Here is the result after removing dupe points (over 250K points) and outliers (around 250 data points), and clustering with K=3 (best of out of 5 runs with the replicates option):
XX = unique(XX,'rows');
XX(XX(:,1) < -500 | XX(:,1) > 100, :) = [];
XX(XX(:,2) < -500 | XX(:,2) > 100, :) = [];
XX(XX(:,3) < 900 | XX(:,3) > 1500, :) = [];
[IDX,C] = kmeans(XX, 3, 'replicates',5);
with almost an equal split across the three clusters:
>> tabulate(IDX)
Value Count Percent
1 15605 36.92%
2 15048 35.60%
3 11613 27.48%
Recall that the default distance function is euclidean distance, which explains the shape of the formed clusters.
If you are confident with your choice of "k=3", here is the code I wrote for not getting an empty cluster:
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
while length(unique(IDX))<3 || histc(histc(IDX,[1 2 3]),1)~=0
% i.e. while one of the clusters is empty -- or -- we have one or more clusters with only one member
[IDX,C] = kmeans(XX,3,'distance','cosine','start','sample', 'emptyaction','singleton');
end
Amro described the reason clearly:
It is simply telling you that during the assign-recompute iterations,
a cluster became empty (lost all assigned points). This is usually
caused by an inadequate cluster initialization, or that the data has
less inherent clusters than you specified.
But the other option that could help to solve this problem is emptyaction:
Action to take if a cluster loses all its member observations.
error: Treat an empty cluster as an error (default).
drop: Remove any clusters that become empty. kmeans sets the corresponding return values in C and D to NaN. (for information
about C and D see kmeans documentioan page)
singleton: Create a new cluster consisting of the one point furthest from its centroid.
An example:
Let’s run a simple code to see how this option changes the behavior and results of kmeans. This sample tries to partition 3 observations in 3 clusters while 2 of them are located at same point:
clc;
X = [1 2; 1 2; 2 3];
[I, C] = kmeans(X, 3, 'emptyaction', 'singleton');
[I, C] = kmeans(X, 3, 'emptyaction', 'drop');
[I, C] = kmeans(X, 3, 'emptyaction', 'error')
The first call with singleton option displays a warning and returns:
I = C =
3 2 3
2 1 2
1 1 2
As you can see two cluster centroids are created at same location ([1 2]), and two first rows of X are assigned to these clusters.
The Second call with drop option also displays same warning message, but returns different results:
I = C =
1 1 2
1 NaN NaN
3 2 3
It just returns two cluster centers and assigns two first rows of X to same cluster. I think most of the times this option would be most useful. In cases that observations are too close and we need as more cluster centers as possible, we can let MATLAB decide about the number. You can remove NaN rows form C like this:
C(any(isnan(C), 2), :) = [];
And finally the third call generates an exception and halts the program as expected.
Empty cluster created at iteration 1.