Agglomerative Clustering in Matlab - matlab

I have a simple 2-dimensional dataset that I wish to cluster in an agglomerative manner (not knowing the optimal number of clusters to use). The only way I've been able to cluster my data successfully is by giving the function a 'maxclust' value.
For simplicity's sake, let's say this is my dataset:
X=[ 1,1;
1,2;
2,2;
2,1;
5,4;
5,5;
6,5;
6,4 ];
Naturally, I would want this data to form 2 clusters. I understand that if I knew this, I could just say:
T = clusterdata(X,'maxclust',2);
and to find which points fall into each cluster I could say:
cluster_1 = X(T==1, :);
and
cluster_2 = X(T==2, :);
but without knowing that 2 clusters would be optimal for this dataset, how do I cluster these data?
Thanks

The whole point of this method is that it represents the clusters found in a hierarchy, and it is up to you to determine how much details you want to get..
Think of this as having a horizontal line intersecting the dendrogram, which moves starting from 0 (each point is its own cluster) all the way to the max value (all points in one cluster). You could:
stop when you reach a predetermined number of clusters (example)
manually position it given a certain height value (example)
choose to place it where the clusters are too far apart according to the distance criterion (ie there's a big jump to the next level) (example)
This can be done by either using the 'maxclust' or 'cutoff' arguments of the CLUSTER/CLUSTERDATA functions

To choose the optimal number of clusters, one common approach is to make a plot similar to a Scree Plot. Then you look for the "elbow" in the plot, and that is the number of clusters you pick. For the criterion here, we will use the within-cluster sum-of-squares:
function wss = plotScree(X, n)
wss = zeros(1, n);
wss(1) = (size(X, 1)-1) * sum(var(X, [], 1));
for i=2:n
T = clusterdata(X,'maxclust',i);
wss(i) = sum((grpstats(T, T, 'numel')-1) .* sum(grpstats(X, T, 'var'), 2));
end
hold on
plot(wss)
plot(wss, '.')
xlabel('Number of clusters')
ylabel('Within-cluster sum-of-squares')
>> plotScree(X, 5)
ans =
54.0000 4.0000 3.3333 2.5000 2.0000

Related

matlab k-means clustering evaluation [duplicate]

This question already has answers here:
Evaluating K-means accuracy
(2 answers)
Closed 5 years ago.
How effectively evaluate the performance of the standard matlab k-means implementation.
For example I have a matrix X
X = [1 2;
3 4;
2 5;
83 76;
97 89]
For every point I have a gold standard clustering. Let's assume that (83,76), (97,89) is the first cluster and (1,2), (3,4), (2,5) is the second cluster. Then we run matlab
idx = kmeans(X,2)
And get the following results
idx = [1; 1; 2; 2; 2]
According the the NOMINAL values it's very bad clustering because only (2,5) is correct, but we don't care about nominal values, we care only about points that is clustered together. Therefore somehow we have to identify that only (2,5) gets to the incorrect cluster.
For me a newbie in matlab is not a trivial task to evaluate the performance of clustering. I would appreciate if you can share with us your ideas about how to evaluate the performance.
To evaluate the "best clustering" is somewhat ambiguous, especially if you have points in two different groups that may eventually cross over with respect to their features. When you get this case, how exactly do you define which cluster those points get merged to? Here's an example from the Fisher Iris dataset that you can get preloaded with MATLAB. Let's specifically take the sepal width and sepal length, which is the third and fourth columns of the data matrix, and plot the setosa and virginica classes:
load fisheriris;
plot(meas(101:150,3), meas(101:150,4), 'b.', meas(51:100,3), meas(51:100,4), 'r.', 'MarkerSize', 24)
This is what we get:
You can see that towards the middle, there is some overlap. You are lucky in that you knew what the clusters were before hand and so you can measure what the accuracy is, but if we were to get data such as the above and we didn't know what labels each point belonged to, how do you know which cluster the middle points belong to?
Instead, what you should do is try and minimize these classification errors by running kmeans more than once. Specifically, you can override the behaviour of kmeans by doing the following:
idx = kmeans(X, 2, 'Replicates', num);
The 'Replicates' flag tells kmeans to run for a total of num times. After running kmeans num times, the output memberships are those which the algorithm deemed to be the best over all of those times kmeans ran. I won't go into it, but they determine what the "best" average is out of all of the membership outputs and gives you those.
Not setting the Replicates flag obviously defaults to running one time. As such, try increasing the total number of times kmeans runs so that you have a higher probability of getting a higher quality of cluster memberships. By setting num = 10, this is what we get with your data:
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
2
2
2
1
1
You'll see that the first three points belong to one cluster while the last two points belong to another. Even though the IDs are flipped, it doesn't matter as we want to be sure that there is a clear separation between the groups.
Minor note with regards to random algorithms
If you take a look at the comments above, you'll notice that several people tried running the kmeans algorithm on your data and they received different clustering results. The reason why is because when kmeans chooses the initial points for your cluster centres, these are chosen in a random fashion. As such, depending on what state their random number generator was in, it is not guaranteed that the initial points chosen for one person will be the same as another person.
Therefore, if you want reproducible results, you should set the random seed of your random seed generator to be the same before running kmeans. On that note, try using rng with an integer that is known before hand, like 123. If we did this before the code above, everyone who runs the code will be able to reproduce the same results.
As such:
rng(123);
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
1
1
1
2
2
Here the labels are reversed, but I guarantee that if any else runs the above code, they will get the same labelling as what was produced above each time.

MATLAB - histograms of equal size and histogram overlap

An issue I've come across multiple times is wanting to take two similar data sets and create histograms from them where the bins are identical, so as to easily calculate things like histogram overlap.
You can define the number of bins (obviously) using
[counts, bins] = hist(data,number_of_bins)
But there's not an obvious way (as far as I can see) to make the bin size equal for several different data sets. If remember when I initially looked finding various people who seem to have the same issue, but no good solutions.
The right, easy way
As pointed out by horchler, this can easily be achieved using either histc (which lets you define your bins vector), or vectorizing your histogram input into hist.
The wrong, stupid way
I'm leaving below as a reminder to others that even stupid questions can yield worthwhile answers
I've been using the following approach for a while, so figured it might be useful for others (or, someone can very quickly point out the correct way to do this!).
The general approach relies on the fact that MATLAB's hist function defines an equally spaced number of bins between the largest and smallest value in your sample. So, if you append a start (smallest) and end (largest) value to your various samples which is the min and max for all samples of interest, this forces the histogram range to be equal for all your data sets. You can then truncate the first and last values to recreate your original data.
For example, create the following data set
A = randn(1,2000)+7
B = randn(1,2000)+9
[counts_A, bins_A] = hist(A, 500);
[counts_B, bins_B] = hist(B, 500);
Here for my specific data sets I get the following results
bins_A(1) % 3.8127 (this is also min(A) )
bins_A(500) % 10.3081 (this is also max(A) )
bins_B(1) % 5.6310 (this is also min(B) )
bins_B(500) % 13.0254 (this is also max(B) )
To create equal bins you can simply first define a min and max value which is slightly smaller than both ranges;
topval = max([max(A) max(B)])+0.05;
bottomval = min([min(A) min(B)])-0.05;
The addition/subtraction of 0.05 is based on knowledge of the range of values - you don't want your extra bin to be too far or too close to the actual range. That being said, for this example by using the joint min/max values this code will work irrespective of the A and B values generated.
Now we re-create histogram counts and bins using (note the extra 2 bins are for our new largest and smallest value)
[counts_Ae, bins_Ae] = hist([bottomval, A, topval], 502);
[counts_Be, bins_Be] = hist([bottomval, B, topval], 502);
Finally, you truncate the first and last bin and value entries to recreate your original sample exactly
bins_A = bins_Ae(2:501)
bins_B = bins_Ae(2:501)
counts_A = counts_Ae(2:501)
counts_B = counts_Be(2:501)
Now
bins_A(1) % 3.7655
bins_A(500) % 13.0735
bins_B(1) % 3.7655
bins_B(500) % 13.0735
From this you can easily plot both histograms again
bar([bins_A;bins_B]', [counts_A;counts_B]')
And also plot the histogram overlap with ease
bar(bins_A,(counts_A+counts_B)-(abs(counts_A-counts_B)))

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

Matlab fast neighborhood operation

I have a Problem. I have a Matrix A with integer values between 0 and 5.
for example like:
x=randi(5,10,10)
Now I want to call a filter, size 3x3, which gives me the the most common value
I have tried 2 solutions:
fun = #(z) mode(z(:));
y1 = nlfilter(x,[3 3],fun);
which takes very long...
and
y2 = colfilt(x,[3 3],'sliding',#mode);
which also takes long.
I have some really big matrices and both solutions take a long time.
Is there any faster way?
+1 to #Floris for the excellent suggestion to use hist. It's very fast. You can do a bit better though. hist is based on histc, which can be used instead. histc is a compiled function, i.e., not written in Matlab, which is why the solution is much faster.
Here's a small function that attempts to generalize what #Floris did (also that solution returns a vector rather than the desired matrix) and achieve what you're doing with nlfilter and colfilt. It doesn't require that the input have particular dimensions and uses im2col to efficiently rearrange the data. In fact, the the first three lines and the call to im2col are virtually identical to what colfit does in your case.
function a=intmodefilt(a,nhood)
[ma,na] = size(a);
aa(ma+nhood(1)-1,na+nhood(2)-1) = 0;
aa(floor((nhood(1)-1)/2)+(1:ma),floor((nhood(2)-1)/2)+(1:na)) = a;
[~,a(:)] = max(histc(im2col(aa,nhood,'sliding'),min(a(:))-1:max(a(:))));
a = a-1;
Usage:
x = randi(5,10,10);
y3 = intmodefilt(x,[3 3]);
For large arrays, this is over 75 times faster than colfilt on my machine. Replacing hist with histc is responsible for a factor of two speedup. There is of course no input checking so the function assumes that a is all integers, etc.
Lastly, note that randi(IMAX,N,N) returns values in the range 1:IMAX, not 0:IMAX as you seem to state.
One suggestion would be to reshape your array so each 3x3 block becomes a column vector. If your initial array dimensions are divisible by 3, this is simple. If they don't, you need to work a little bit harder. And you need to repeat this nine times, starting at different offsets into the matrix - I will leave that as an exercise.
Here is some code that shows the basic idea (using only functions available in FreeMat - I don't have Matlab on my machine at home...):
N = 100;
A = randi(0,5*ones(3*N,3*N));
B = reshape(permute(reshape(A,[3 N 3 N]),[1 3 2 4]), [ 9 N*N]);
hh = hist(B, 0:5); % histogram of each 3x3 block: bin with largest value is the mode
[mm mi] = max(hh); % mi will contain bin with largest value
figure; hist(B(:),0:5); title 'histogram of B'; % flat, as expected
figure; hist(mi-1, 0:5); title 'histogram of mi' % not flat?...
Here are the plots:
The strange thing, when you run this code, is that the distribution of mi is not flat, but skewed towards smaller values. When you inspect the histograms, you will see that is because you will frequently have more than one bin with the "max" value in it. In that case, you get the first bin with the max number. This is obviously going to skew your results badly; something to think about. A much better filter might be a median filter - the one that has equal numbers of neighboring pixels above and below. That has a unique solution (while mode can have up to four values, for nine pixels - namely, four bins with two values each).
Something to think about.
Can't show you a mex example today (wrong computer); but there are ample good examples on the Mathworks website (and all over the web) that are quite easy to follow. See for example http://www.shawnlankton.com/2008/03/getting-started-with-mex-a-short-tutorial/

Pruning data for better viewing on loglog graph - Matlab

just wondering if anyone has any ideas about an issue I'm having.
I have a fair amount of data that needs to be displayed on one graph. Two theoretical lines that are bold and solid are displayed on top, then 10 experimental data sets that converge to these lines are graphed, each using a different identifier (eg the + or o or a square etc). These graphs are on a log scale that goes up to 1e6. The first few decades of the graph (< 1e3) look fine, but as all the datasets converge (> 1e3) it's really difficult to see what data is what.
There's over 1000 data points points per decade which I can prune linearly to an extent, but if I do this too much the lower end of the graph will suffer in resolution.
What I'd like to do is prune logarithmically, strongest at the high end, working back to 0. My question is: how can I get a logarithmically scaled index vector rather than a linear one?
My initial assumption was that as my data is lenear I could just use a linear index to prune, which lead to something like this (but for all decades):
//%grab indicies per decade
ind12 = find(y >= 1e1 & y <= 1e2);
indlow = find(y < 1e2);
indhigh = find(y > 1e4);
ind23 = find(y >+ 1e2 & y <= 1e3);
ind34 = find(y >+ 1e3 & y <= 1e4);
//%We want ind12 indexes in this decade, find spacing
tot23 = round(length(ind23)/length(ind12));
tot34 = round(length(ind34)/length(ind12));
//%grab ones to keep
ind23keep = ind23(1):tot23:ind23(end);
ind34keep = ind34(1):tot34:ind34(end);
indnew = [indlow' ind23keep ind34keep indhigh'];
loglog(x(indnew), y(indnew));
But this causes the prune to behave in a jumpy fashion obviously. Each decade has the number of points that I'd like, but as it's a linear distribution, the points tend to be clumped at the high end of the decade on the log scale.
Any ideas on how I can do this?
I think the easiest way to do this would be to use the LOGSPACE function to generate a set of indices into your data. For example, to create a set of 100 points logarithmically spaced from 1 to N (the number of points in your data), you can try the following:
indnew = round(logspace(0,log10(N),100)); %# Create the log-spaced index
indnew = unique(indnew); %# Remove duplicate indices
loglog(x(indnew),y(indnew)); %# Plot the indexed data
Creating a logarithmically-spaced index like this will result in fewer values being chosen from the end of the vector relative to the start, thus pruning values more severely towards the end of the vector and improving the appearance of the log plot. It would therefore be most effective with vectors that are sorted in ascending order.
The way I understand the problem is that your x-values are linearly spaced, so that if you plot them logarithmically, there are way more data points in 'higher' decades, so that markers lie extremely close to one another. For example, if x goes from 1 to 1000, there are 10 points in the first decade 90 in the second, and 900 in the third. You want to have, say, 3 points per decade instead.
I see two ways to solve the problem. The easier one is to use differently colored lines instead of different markers. Thus, you don't sacrifice any data points, and you can still distinguish everything.
The second solution is to create an unevenly spaced index. Here's how you can do that.
%# create some data
x = 1:1000;
y = 2.^x;
%# plot the graph and see the dots 'coalesce' very quickly
figure,loglog(x,y,'.')
%# for the example, I use a step size of 0.7, which is `log(1)`
xx = 0.7:0.7:log(x(end)); %# this is where I want the data to be plotted
%# find the indices where we want to plot by finding the closest `log(x)'-values
%# run unique to avoid multiples of the same index
indnew = unique(interp1(log(x),1:length(x),xx,'nearest'));
%# plot with fewer points
figure,loglog(x(indnew),y(indnew),'.')