How to interpret k-means cluster results

How to interpret k-means cluster results - cluster-analysis

I have a normalized table (applied minmax scalar) on which k-means of 5 clusters were applied. The last column in the table shows the cluster number. How to infer this for the dimensions?

What doe this mean...
How to infer this for the dimensions?
I'm guessing you want to plot the results, right. Try it like this.
import plotly.express as px
fig = px.scatter(df, x="protein", y="fat", color="norm_cluster", size='sodium', hover_data=['norm_cluster'])
fig.show()
See the link below for more details.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb

Related

How do I plot values in an array vs the number of times those values appear in Matlab?

I have a set of ages (over 10000 of them) and I want to plot a graph with the age from 20 to 100 on the x axis and then the number of times each of those ages appears in the data on the y axis. I have tried several ways to do this and I can't figure it out. I also have some other data which requires me to plot values vs how many times they occur so any advice on how to do this would be much appreciated.
I'm quite new to Matlab so it would be great if you could explain how things in your answer work rather than just typing out some code.
Thanks.
EDIT:
So I typed histogram(Age, 80) because as I understand that will plot the values in Age on a histogram split up into 80 bars (1 for each age). Instead I get this:
The bars aren't aligned and it's clearly not 1 per age nor has it plotted the number of times each age occurs on the y axis.

You have to use histogram(), and that's correct.
Let's see with an example.
I extract 100 ages between 20 and 100:
ages=randsample([20:100],100,true);
Now I call histogram() in this manner:
h=histogram(ages,[20:100]);
where h is an histogram object and this will also show the following plot:
However, this might look easy due to the fact that my ages vector is in range 20:100, so it will not contain any other values. If your vector, as instead, contains also ages not in range 20:100, you can specify the additional option 'BinLimits' as third input in histogram() like this:
h=histogram(ages,length([20:100]),'BinLimits',[20:100]);
and this option plots a histogram using the values in ages that fall between 20 and 100 inclusive.
Note: by inspecting h you can actually see and/or edit some proprieties of your histogram. An attribute (field) of such object you might be interested to is Values. This is a vector of length 80 (in our case, since we work with 80 bins) in which the i-th element is the number of items is the i-th bin. This will help you count the occurrences (just in case you need them to go on with your analysis).

Like Luis said in comments, hist is the way to go. You should specify bin edges, rather than the number of bins:
ages = randi([20 100], [1 10000]);
hist(ages, [20:100])
Is this what you were looking for?

Figuring out where to add text in a MATLAB histogram

I have a MATLAB histogram graph produced with some data, with some 50 bins. I now need to insert a line of text into the graph, at any place where it wouldn't tangle with the histogram bars. The text is basically 'Period of data used: mmm dd to mmm dd' (I mention this to give an idea of the width required and where the text can be split if necessary).
One method I considered was finding out a series of contiguous histogram bins where the freq (y axis) remains less than 90% of the maximum of all frequencies; then, the text can be printed at the x position starting at the first of those bins near the top of the graph.
Is this a good way of going about it? If so, how do I compute this contiguous series of bins without looping around?
Or is there a better way of placing this text adaptively according to the data?
Edit: Due to other considerations, the number of histogram bins is not a fixed 50 any more, but rather xmax/20 where xmax is the maximum x-axis value. Algorithms that depend on working on aggregates of a number of bins might need to take this variability into account, when calculating that number.

I think the simplest way would be to use a multiline title, optionally along with TeX formatting to de-emphasise the additional info. To make a multiline title, pass a cell array of strings like this:
title({'\fontsize{16}Actual Title';'\fontsize{8}other info'})
Being consistent across the histograms, I think this would look tidier than having text on the graph itself that might move around.

Matlab - Dilation function alternative

I'm looking through various online sources trying to learn some new stuff with matlab.
I can across a dilation function, shown below:
function rtn = dilation(in)
h =size(in,1);
l =size(in,2);
rtn = zeros(h,l,3);
rtn(:,:,1)=[in(2:h,:); in(h,:)];
rtn(:,:,2)=in;
rtn(:,:,3)=[in(1,:); in(1:h-1,:)];
rtn_two = max(rtn,[],3);
rtn(:,:,1)=[rtn_two(:,2:l), rtn_two(:,l)];
rtn(:,:,2)=rtn_two;
rtn(:,:,3)=[rtn_two(:,1), rtn_two(:,1:l-1)];
rtn = max(rtn,[],3);
The parameter it takes is: max(img,[],3) %where img is an image
I was wondering if anyone could shed some light on what this function appears to do and if there's a better (or less confusing way) to do it? Apart from a small wiki entry, I can't seem to find any documentation, hence asking for your help.
Could this be achieved with the imdilate function maybe?

What this is doing is creating two copies of the image shifted by one pixel up/down (with the last/first row duplicated to preserve size), then taking the max value of the 3 images at each point to create a vertically dilated image. Since the shifted copies and the original are layered in a 3-d matrix, max(img,[],3) 'flattens' the 3 layers along the 3rd dimension. It then repeats this column-wise for the horizontal part of the dilation.
For a trivial image:
00100
20000
00030
Step 1:
(:,:,1) (:,:,2) (:,:,3) max
20000 00100 00100 20100
00030 20000 00100 20130
00030 00030 20000 20030
Step 2:
(:,:,1) (:,:,2) (:,:,3) max
01000 20100 22010 22110
01300 20130 22013 22333
00300 20030 22003 22333
You're absolutely correct this would be simpler with the Image Processing Toolbox:
rtn = imdilate(in, ones(3));
With the original code, dilating by more than one pixel would require multiple iterations, and because it operates one dimension at a time it's limited to square (or possibly rectangular, with a bit of modification) structuring elements.

Your function replaces each element with the maximum value among the corresponding 3*3 kernel. By creating a 3D matrix, the function align each element with two of its shift, thus equivalently achieves the 3*3 kernel. Such alignment was done twice to find the maximum value along each column and row respectively.
You can generate a simple matrix to compare the result with imdilate:
a=magic(8)
rtn = dilation(a)
b=imdilate(a,ones(3))
Besides imdilate, you can also use
c=ordfilt2(a,9,ones(3))
to get the same result ( implements a 3-by-3 maximum filter. )
EDIT
You may have a try on 3D image with imdilate as well:
a(:,:,1)=magic(8);
a(:,:,2)=magic(8);
a(:,:,3)=magic(8);
mask = true(3,3,3);
mask(2,2,2) = false;
d = imdilate(a,mask);

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.

For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.

You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))

Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).

Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.

I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

Preserving matrix columns using Matlab brush/select data tool

I'm working with matrices in Matlab which have five columns and several million rows. I'm interested in picking particular groups of this data. Currently I'm doing this using plot3() and the brush/select data tool.
I plot the first three columns of the matrix as X,Y, Z and highlight the matrix region I'm interested in. I then use the brush/select tool's "Create variable" tool to export that region as a new matrix.
The problem is that when I do that, the remaining two columns of the original, bigger matrix are dropped. I understand why- they weren't plotted and hence the figure tool doesn't know about them. I need all five columns of that subregion though in order to continue the processing pipeline.
I'm adding the appropriate 4th and 5th column values to the exported matrix using a horrible nested if loop approach- if columns 1, 2 and 3 match in both the original and exported matrix, attach columns 4/5 of the original matrix to the exported one. It's bad design and agonizingly slow. I know there has to be a Matlab function/trick for this- can anyone help?
Thanks!
This might help:
1. I start with matrix 1 with columns X,Y,Z,A,B
2. Using the brush/select tool, I create a new (subregion) matrix 2 with columns X,Y,Z
3. I then loop through all members of matrix 2 against all members of matrix 1. If X,Y,Z match for a pair of rows, I append A and B
from that row in matrix 1 to the appropriate row in matrix 2.
4. I become very sad as this takes forever and shows my ignorance of Matlab.

If I understand your situation correctly here is a simple way to do it:
Assuming you have a matrix like so: M = [A B C D E] where each letter is a Nx1 vector.
You select a range, this part is not really clear to me, but suppose you can create the following:
idxA,idxB and idxC, that are 1 if they are in the region and 0 otherwise.
Then you can simply use:
M(idxA&idxB&idxC,:)
and you will get the additional two columns as well.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse