I want to carry out hierarchical clustering in Matlab and plot the clusters on a scatterplot. I have used the evalclusters function to first investigate what a 'good' number of clusters would be using different criteria values eg Silhouette, CalinskiHarabasz. Here is the code I used for the evaluation (x is my data with 200 observations and 10 variables):
E = evalclusters(x,'linkage','CalinskiHarabasz','KList',[1:10])
%store kmean optimal clusters
optk=E.OptimalK;
%save the outouts to a structure
clust_struc(1).Optimalk=optk;
clust_struc(1).method={'CalinskiHarabasz'}
I then used code similar to what I have found online:
gscatter(x(:,1),x(:,2),E.OptimalY,'rbgckmr','xod*s.p')
%OptimalY is a vector 200 long with the cluster numbers
and this is what I get:
My question may be silly, but I don't understand why I am only using the first two columns of data to produce the scatter plot? I realise that the clusters themselves are being incorporated through the use of the Optimal Y, but should I not be using all of the data in x?
Each row in x is an observation with properties in size(x,2) dimensions. All this dimensions are used for clustering x rows.
However, when plotting the clusters, we cannot plot more than 2-3 dimensions so we try to represent each element with its key properties. I'm not sure that x(:,1),x(:,2) are the best option, but you have to choose 2 for a 2-D plot.
Usually you would have some property of interest that you want to plot. Have a look at the example in MATLAB doc: the fisheriris data has 4 different variables - the length and width measurements from the sepals and petals of three species of iris flowers. It is up to you to decide which you want to plot against each other (in the example they choosed Petal Length and Petal Width).
Here is a comparison between taking Petals measurements and Sepals measurements as the axis for plotting the grouping:
I want to use GMM(Gaussian mixture models for clustering a binary image and also want to plot the cluster centroids on the binary image itself.
I am using this as my reference:
http://in.mathworks.com/help/stats/gaussian-mixture-models.html
This is my initial code
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
P=normpdf(K, mu, sigma);
Z = norminv(P,mu,sigma);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
scatter(X(:,1),X(:,2),10,'ko');
options = statset('Display','final');
gm = fitgmdist(X,2,'Options',options);
idx = cluster(gm,X);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(X(cluster1,1),X(cluster1,2),10,'r+');
hold on
scatter(X(cluster2,1),X(cluster2,2),10,'bo');
hold off
legend('Cluster 1','Cluster 2','Location','NW')
P = posterior(gm,X);
scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
hold on
scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
hold off
legend('Cluster 1','Cluster 2','Location','NW')
clrmap = jet(80); colormap(clrmap(9:72,:))
ylabel(colorbar,'Component 1 Posterior Probability')
But the problem is that I am unable to plot the cluster centroids received from GMM in the primary binary image.How do i do this?
**Now suppose i have 10 such images in a sequence And i want to store the information of their mean position in two cell array then how do i do that.This is my code foe my new question **
images=load('gait2go.mat');%load the matrix file
for i=1:10
I{i}=images.result{i};
I{i}=im2double(I{i});
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I{i});
whites=true(Idims(1),Idims(2));
df=I{i};
%we add up the various color channels
for colori=1:size(df,3)
whites=whites & df(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
xc{i}=meanx(i);%cell array contaning the position of the mean for the 10
images
xb{i}=meany(i);
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(meany,-meanx,20,'+'); %same funky coordinates
end
I am able to get 10 images segmented but no the values of themean stored in the cell arrays xc and xb.They r only storing [] in place of the values of means
I decided to post an answer to your question (where your question was determined by a maximum-likelihood guess:P), but I wrote an extensive introduction. Please read carefully, as I think you have difficulties understanding the methods you want to use, and you have difficulties understanding why others can't help you with your usual approach of asking questions. There are several problems with your question, both code-related and conceptual. Let's start with the latter.
The problem with the problem
You say that you want to cluster your image with Gaussian mixture modelling. While I'm generally not familiar with clustering, after a look through your reference and the wonderful SO answer you cited elsewhere (and a quick 101 from #rayryeng) I think you are on the wrong track altogether.
Gaussian mixture modelling, as its name suggests, models your data set with a mixture of Gaussian (i.e. normal) distributions. The reason for the popularity of this method is that when you do measurements of all sorts of quantities, in many cases you will find that your data is mostly distributed like a normal distribution (which is actually the reason why it's called normal). The reason behind this is the central limit theorem, which implies that the sum of reasonably independent random variables tends to be normal in many cases.
Now, clustering, on the other hand, simply means separating your data set into disjoint smaller bunches based on some criteria. The main criterion is usually (some kind of) distance, so you want to find "close lumps of data" in your larger data set. You usually need to cluster your data before performing a GMM, because it's already hard enough to find the Gaussians underlying your data without having to guess the clusters too. I'm not familiar enough with the procedures involved to tell how well GMM algorithms can work if you just let them work on your raw data (but I expect that many implementations start with a clustering step anyway).
To get closer to your question: I guess you want to do some kind of image recognition. Looking at the picture, you want to get more strongly correlated lumps. This is clustering. If you look at a picture of a zoo, you'll see, say, an elephant and a snake. Both have their distinct shapes, and they are well separated from one another. If you cluster your image (and the snake is not riding the elephant, neither did it eat it), you'll find two lumps: one lump elephant-shaped, and one lump snake-shaped. Now, it wouldn't make sense to use GMM on these data sets: elephants, and especially snakes, are not shaped like multivariate Gaussian distributions. But you don't need this in the first place, if you just want to know where the distinct animals are located in your picture.
Still staying with the example, you should make sure that you cluster your data into an appropriate number of subsets. If you try to cluster your zoo picture into 3 clusters, you might get a second, spurious snake: the nose of the elephant. With an increasing number of clusters your partitioning might make less and less sense.
Your approach
Your code doesn't give you anything reasonable, and there's a very good reason for that: it doesn't make sense from the start. Look at the beginning:
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
You read your binary image, convert it to double, then stretch it out into a vector and compute the mean and deviation of that vector. You basically smear your intire image into 2 values: an average intensity and a deviation. And THEN you generate 111*10 standard normal points with these parameters, and try to do GMM on the first two sets of 111. Which are both independently normal with the same parameter. So you probably get two overlapping Gaussians around the same mean with the same deviation.
I think the examples you found online confused you. When you do GMM, you already have your data, so no pseudo-normal numbers should be involved. But when people post examples, they also try to provide reproducible inputs (well, some of them do, nudge nudge wink wink). A simple method for this is to generate a union of simple Gaussians, which can then be fed into GMM.
So, my point is, that you don't have to generate random numbers, but have to use the image data itself as input to your procedure. And you probably just want to cluster your image, instead of actually using GMM to draw potatoes over your cluster, since you want to cluster body parts in an image about a human. Most body parts are not shaped like multivariate Gaussians (with a few distinct exceptions for men and women).
What I think you should do
If you really want to cluster your image, like in the figure you added to your question, then you should use a method like k-means. But then again, you already have a program that does that, don't you? So I don't really think I can answer the question saying "How can I cluster my image with GMM?". Instead, here's an answer to "How can I cluster my image?" with k-means, but at least there will be a piece of code here.
%set infile to what your image file will be
infile='sil10001.pbm';
%read file
I=im2double(imread(infile));
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I);
whites=true(Idims(1),Idims(2));
%we add up the various color channels
for colori=1:Idims(3)
whites=whites & I(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to image
axis equal;
hold on;
scatter(meany,-meanx,20,'ko'); %same funky coordinates
Here's what this does. It first reads your image as double like yours did. Then it tries to determine "white" pixels by checking that each color channel (of which can be either 1, 3 or 4) is brighter than 0.5. Then your input data points to the clustering will be the x and y "coordinates" (i.e. indices) of your white pixels.
Next it does the clustering via kmeans. This part of the code is loosely based on the already cited answer of Amro. I had to set a large maximal number of iterations, as the problem is ill-posed in the sense that there aren't 10 clear clusters in the picture. Then we compute the mean for each cluster, and plot the clusters with gscatter, and the means with scatter. Note that in order to have the picture facing in the right directions in a scatter plot you have to shift around the input coordinates. Alternatively you could define datax and datay correspondingly at the beginning.
And here's my output, run with the already processed figure you provided in your question:
I do believe you must had made a naive mistake in the plot and that's why you see just a straight line: You are plotting only the x values.
In my opinion, the second argument in the scatter command should be X(cluster1,2) or X(cluster2,2) depending on which scatter command is being used in the code.
The code can be made more simple:
%read file
I=im2double(imread('sil10340.pbm'));
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(I);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
[cInd, c] = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(c(:,2),-c(:,1),20,'ko'); %same funky coordinates
I don't think there is nay need for the looping as the c itself return a 10x2 double array which contains the position of the means
So I was trying to spread one matrix elements, which were generated with poissrnd, to another with using some bigger (wider?) probability function (for example 100 different possibilities with different weights) to plot both of them and see if the fluctuations after spread went down. After seeing it doesn't work right (fluctuations got bigger) I tried to identify what I did wrong on a really simple example. After testing it for a really long time I still can't understand what's wrong. The example goes like this:
I generate vector with poissrnd and vector for spreading (filled with zeros at the start)
Each element from the poiss vector tells me how many numbers (0.1 of the element value) to generate from the possible options which are: [1,2,3] with corresponding weights [0.2,0.5,0.2]
I spread what I got to my another vector on 3 elements: the corresponding (k-th one), one bofore the corresponding one and one after the corresponding one (so for example if k=3, the elements should be spread like this: most should go into 3rd element of another vector, and rest should go to 2nd and 1st element)
Plot both 0.1*poiss vector and vector after spreading to compare if fluctuations went down
The way I generate weighted numbers is from this thread: Weighted random numbers in MATLAB
and this is the code I'm using:
clear all
clc
eta=0.1;
N=200;
fot=10000000;
ix=linspace(-100,100,N);
mn =poissrnd(fot/N, 1, N);
dataw=zeros(1,N);
a=1:3;
w=[.25,.5,.25];
for k=1:N
[~,R] = histc(rand(1,eta*mn(1,k)),cumsum([0;w(:)./sum(w)]));
R = a(R);
przydz=histc(R,a);
if (k>1) && (k<N)
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,k-1)=dataw(1,k-1)+przydz(1,1);
dataw(1,k+1)=dataw(1,k+1)+przydz(1,3);
elseif k==1
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,N)=dataw(1,N)+przydz(1,1);
dataw(1,k+1)=dataw(1,k+1)+przydz(1,3);
else
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,k-1)=dataw(1,k-1)+przydz(1,1);
dataw(1,1)=dataw(1,1)+przydz(1,3);
end
end
plot(ix,eta*mn,'g',ix,dataw,'r')
The fluctuations are still bigger, and I can't identify what's wrong... Is the method for generating weighted numbers wrong in this situation? Cause it doesn't seem so. The way I'm accumulating data from the first vector seems fine too. Is there another way I could do it (so I could then optimize it for using 'bigger' probability functions)?
Sorry for my terrible English.
[EDIT]:
Here is simple pic to show what I meant (I hope it's understandable)
How about trying negative binomial distribution? It is often used as a hyper-dispersed analogue of Poisson distribution. Additional links can be found in this paper, as well as some apparatus in supplement.
I have a huge sparse matrix (1,000 x 1,000,000) that I cannot load on matlab (not enough RAM).
I want to visualize this matrix to have an idea of its sparsity and of the differences of the values.
Because of the memory constraints, I want to proceed as follows:
1- Divide the matrix into 4 matrices
2- Load each matrix on matlab and visualize it so that the colors give an idea of the values (and of the zeros particularly)
3- "Stick" the 4 images I will get in order to have a global idea for the original matrix
(i) Is it possible to load "part of a matrix" in matlab?
(ii) For the visualization tool, I read about spy (and daspect). However, this function only enables to visualize the non-zero values indifferently of their scales. Is there a way to add a color code?
(iii) How can I "stick" plots in order to make one?
If your matrix is sparse, then it seems that the currently method of storing it (as a full matrix in a text file) is very inefficient, and certainly makes loading it into MATLAB very hard. However, I suspect that as long as it is sparse enough, it can still be leaded into MATLAB as a sparse matrix.
The traditional way of doing this would be to load it all in at once, then convert to sparse representation. In your case, however, it would make sense to read in the text file, one line at a time, and convert to a MATLAB sparse matrix on-the-fly.
You can find out if this is possible by estimating the sparsity of your matrix, and using this to see if the whole thing could be loaded into MATLAB's memory as a sparse matrix.
Try something like: (untested code!)
% initialise sparse matrix
sparse_matrix = sparse(num_rows, num_cols);
row_num = 1;
fid = fopen(filename);
% read each line of text file in turn
while ~feof(fid)
this_line = fscanf(fid, '%f');
% add row to sparse matrix (note transpose, which I think is required)
sparse_matrix(row_num, :) = this_line';
row_num = row_num + 1;
end
fclose(fid)
% visualise using spy
spy(sparse_matrix)
Visualisation
With regards to visualisation: visualising a sparse matrix like this via a tool like imagesc is possible, but I believe it may internally create the full matrix – maybe someone can confirm if this is true or not. If it does, then it's going to cause you memory problems.
All spy is really doing is plotting in 2D the locations of the non-zero elements. You can fairly easily write your own spy function, which can have different coloured or sized points depending on the values at each location. See this answer for some examples.
Saving sparse matrices
As I say above, the method your matrix is saved as is pretty inefficient – for a matrix with 10% sparsity, around 95% of your text file will be a zero or a space. I don't know where this data has come from, but if you have any control over its creation (e.g. it comes from another program you have written) it would make much more sense to save only the non-zero elements in the format row_idx, col_idx, value.
You can then use spconvert to import the sparse matrix directly.
One of the simplest methods (if you can actually store the full sparse matrix in RAM) is to use gnuplot to visualize the sparisty pattern.
I was able to spy matrices of size 10-20GB using gnuplot without problems. But make sure you use png or jpeg formats to output the image. Note that you don't need the value of the non-zero entry only the integers (row, col). And plot them "plot "row_col.dat" using 1:2 with points".
This chooses your row as x axis and cols as your y axis and start plotting the non-zero entries. It is very easy to do this. This is the most scalable solution I know. Gnuplot works at decent speed even for very large datasets (>10GB of [row, cols]), but Matlab just hangs (with due respect)
I use imagesc() to visualise arrays. It scales the values in array to values between 0 and 1, then plots the array like a greyscale bitmap image (of course you can change the colormap to make it easier to see detail).