Gaussian Mixture Model Plot for Soft Assignment by Color - cluster-analysis

enter image description here)
I want to show soft assignment by probabilities from GMM looking similar to the image posted. I want to show a data points membership between two (or more clusters).
Any help would be appreciated.
I tried to multiply probabilities from maximum of predict_proba(X) but it seems not working as it is not applied to labels=0.
ax.scatter(X[:, 0], X[:, 1], c=labels*probability)

Related

Matlab plot scatterplot with intermediate coordinates

This is an exam task that I have. Lets say I have a 200x6 matrix where 200 people voted a movie with respect to 6 questions, each on a continous [0, 1]-scale (0: disagree, 1: agree).
To get a useful overview of the 6-dimensional dataset I want to plot the rank-2 approximation of the data. First I do the rank 2 approximation:
A = (200, 6); %some data
[U, S, V] = svd(A);
Ak = U(:, 1:2) * S(1:2, 1:2) * V(:, 1:2)';
I want to plot this approximation as a 2D scatterplot with a "*"-mark per survey participant using either U or V coordinates as intermediate coordinates depending on how my data is organized.. The problem is that I don't know what intermediate coordinates mean, and I can't find a good explanation anywhere. Wonder if someone could help, eventually providing a small code example. Any help appreciated, thank you.
Formally, intermediate axes are (ortogonal) linear combinations of your data (along maximun explained variance, a.k.a. principal components).
If most of data have similar shape (e.g. [5 4 3 2 1 0] pattern), then the first component will be similar to this shape/vector, since the variance arount it is minimal (or: variance along it is maximal). Next components as well minimize the rest of variance in ortogonal planes.
So, the answer is: principal components 1 and 2.
And more precicely: first intermediate coordinate value can be understood as magnitude of that "first main pattern" in a single data sample.

How to compute distance and estimate quality of heterogeneous grids in Matlab?

I want to evaluate the grid quality where all coordinates differ in the real case.
Signal is of a ECG signal where average life-time is 75 years.
My task is to evaluate its age at the moment of measurement, which is an inverse problem.
I think 2D approximation of the 3D case is hard (done here by Abo-Zahhad) with with 3-leads (2 on chest and one at left leg - MIT-BIT arrhythmia database):
where f is a piecewise continuous function in R^2, \epsilon is the error matrix and A is a 2D matrix.
Now, I evaluate the average grid distance in x-axis (time) and average grid distance in y-axis (energy).
I think this can be done by Matlab's Image Analysis toolbox.
However, I am not sure how complete the toolbox's approaches are.
I think a transform approach must be used in the setting of uneven and noncontinuous grids. One approach is exact linear time euclidean distance transforms of grid line sampled shapes by Joakim Lindblad et all.
The method presents a distance transform (DT) which assigns to each image point its smallest distance to a selected subset of image points.
This kind of approach is often a basis of algorithms for many methods in image analysis.
I tested unsuccessfully the case with bwdist (Distance transform of binary image) with chessboard (returns empty square matrix), cityblock, euclidean and quasi-euclidean where the last three options return full matrix.
Another pseudocode
% https://stackoverflow.com/a/29956008/54964
%// retrieve picture
imgRGB = imread('dummy.png');
%// detect lines
imgHSV = rgb2hsv(imgRGB);
BW = (imgHSV(:,:,3) < 1);
BW = imclose(imclose(BW, strel('line',40,0)), strel('line',10,90));
%// clear those masked pixels by setting them to background white color
imgRGB2 = imgRGB;
imgRGB2(repmat(BW,[1 1 3])) = 255;
%// show extracted signal
imshow(imgRGB2)
where I think the approach will not work here because the grids are not necessarily continuous and not necessary ideal.
pdist based on the Lumbreras' answer
In the real examples, all coordinates differ such that pdist hamming and jaccard are always 1 with real data.
The options euclidean, cytoblock, minkowski, chebychev, mahalanobis, cosine, correlation, and spearman offer some descriptions of the data.
However, these options make me now little sense in such full matrices.
I want to estimate how long the signal can live.
Sources
J. Müller, and S. Siltanen. Linear and nonlinear inverse problems with practical applications.
EIT with the D-bar method: discontinuous heart-and-lungs phantom. http://wiki.helsinki.fi/display/mathstatHenkilokunta/EIT+with+the+D-bar+method%3A+discontinuous+heart-and-lungs+phantom Visited 29-Feb 2016.
There is a function in Matlab defined as pdist which computes the pairwisedistance between all row elements in a matrix and enables you to choose the type of distance you want to use (Euclidean, cityblock, correlation). Are you after something like this? Not sure I understood your question!
cheers!
Simply, do not do it in the post-processing. Those artifacts of the body can be about about raster images, about the viewer and/or ... Do quality assurance in the signal generation/processing step.
It is much easier to evaluate the original signal than its views.

Clustering an image using Gaussian mixture models

I want to use GMM(Gaussian mixture models for clustering a binary image and also want to plot the cluster centroids on the binary image itself.
I am using this as my reference:
http://in.mathworks.com/help/stats/gaussian-mixture-models.html
This is my initial code
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
P=normpdf(K, mu, sigma);
Z = norminv(P,mu,sigma);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
scatter(X(:,1),X(:,2),10,'ko');
options = statset('Display','final');
gm = fitgmdist(X,2,'Options',options);
idx = cluster(gm,X);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(X(cluster1,1),X(cluster1,2),10,'r+');
hold on
scatter(X(cluster2,1),X(cluster2,2),10,'bo');
hold off
legend('Cluster 1','Cluster 2','Location','NW')
P = posterior(gm,X);
scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
hold on
scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
hold off
legend('Cluster 1','Cluster 2','Location','NW')
clrmap = jet(80); colormap(clrmap(9:72,:))
ylabel(colorbar,'Component 1 Posterior Probability')
But the problem is that I am unable to plot the cluster centroids received from GMM in the primary binary image.How do i do this?
**Now suppose i have 10 such images in a sequence And i want to store the information of their mean position in two cell array then how do i do that.This is my code foe my new question **
images=load('gait2go.mat');%load the matrix file
for i=1:10
I{i}=images.result{i};
I{i}=im2double(I{i});
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I{i});
whites=true(Idims(1),Idims(2));
df=I{i};
%we add up the various color channels
for colori=1:size(df,3)
whites=whites & df(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
xc{i}=meanx(i);%cell array contaning the position of the mean for the 10
images
xb{i}=meany(i);
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(meany,-meanx,20,'+'); %same funky coordinates
end
I am able to get 10 images segmented but no the values of themean stored in the cell arrays xc and xb.They r only storing [] in place of the values of means
I decided to post an answer to your question (where your question was determined by a maximum-likelihood guess:P), but I wrote an extensive introduction. Please read carefully, as I think you have difficulties understanding the methods you want to use, and you have difficulties understanding why others can't help you with your usual approach of asking questions. There are several problems with your question, both code-related and conceptual. Let's start with the latter.
The problem with the problem
You say that you want to cluster your image with Gaussian mixture modelling. While I'm generally not familiar with clustering, after a look through your reference and the wonderful SO answer you cited elsewhere (and a quick 101 from #rayryeng) I think you are on the wrong track altogether.
Gaussian mixture modelling, as its name suggests, models your data set with a mixture of Gaussian (i.e. normal) distributions. The reason for the popularity of this method is that when you do measurements of all sorts of quantities, in many cases you will find that your data is mostly distributed like a normal distribution (which is actually the reason why it's called normal). The reason behind this is the central limit theorem, which implies that the sum of reasonably independent random variables tends to be normal in many cases.
Now, clustering, on the other hand, simply means separating your data set into disjoint smaller bunches based on some criteria. The main criterion is usually (some kind of) distance, so you want to find "close lumps of data" in your larger data set. You usually need to cluster your data before performing a GMM, because it's already hard enough to find the Gaussians underlying your data without having to guess the clusters too. I'm not familiar enough with the procedures involved to tell how well GMM algorithms can work if you just let them work on your raw data (but I expect that many implementations start with a clustering step anyway).
To get closer to your question: I guess you want to do some kind of image recognition. Looking at the picture, you want to get more strongly correlated lumps. This is clustering. If you look at a picture of a zoo, you'll see, say, an elephant and a snake. Both have their distinct shapes, and they are well separated from one another. If you cluster your image (and the snake is not riding the elephant, neither did it eat it), you'll find two lumps: one lump elephant-shaped, and one lump snake-shaped. Now, it wouldn't make sense to use GMM on these data sets: elephants, and especially snakes, are not shaped like multivariate Gaussian distributions. But you don't need this in the first place, if you just want to know where the distinct animals are located in your picture.
Still staying with the example, you should make sure that you cluster your data into an appropriate number of subsets. If you try to cluster your zoo picture into 3 clusters, you might get a second, spurious snake: the nose of the elephant. With an increasing number of clusters your partitioning might make less and less sense.
Your approach
Your code doesn't give you anything reasonable, and there's a very good reason for that: it doesn't make sense from the start. Look at the beginning:
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
You read your binary image, convert it to double, then stretch it out into a vector and compute the mean and deviation of that vector. You basically smear your intire image into 2 values: an average intensity and a deviation. And THEN you generate 111*10 standard normal points with these parameters, and try to do GMM on the first two sets of 111. Which are both independently normal with the same parameter. So you probably get two overlapping Gaussians around the same mean with the same deviation.
I think the examples you found online confused you. When you do GMM, you already have your data, so no pseudo-normal numbers should be involved. But when people post examples, they also try to provide reproducible inputs (well, some of them do, nudge nudge wink wink). A simple method for this is to generate a union of simple Gaussians, which can then be fed into GMM.
So, my point is, that you don't have to generate random numbers, but have to use the image data itself as input to your procedure. And you probably just want to cluster your image, instead of actually using GMM to draw potatoes over your cluster, since you want to cluster body parts in an image about a human. Most body parts are not shaped like multivariate Gaussians (with a few distinct exceptions for men and women).
What I think you should do
If you really want to cluster your image, like in the figure you added to your question, then you should use a method like k-means. But then again, you already have a program that does that, don't you? So I don't really think I can answer the question saying "How can I cluster my image with GMM?". Instead, here's an answer to "How can I cluster my image?" with k-means, but at least there will be a piece of code here.
%set infile to what your image file will be
infile='sil10001.pbm';
%read file
I=im2double(imread(infile));
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I);
whites=true(Idims(1),Idims(2));
%we add up the various color channels
for colori=1:Idims(3)
whites=whites & I(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to image
axis equal;
hold on;
scatter(meany,-meanx,20,'ko'); %same funky coordinates
Here's what this does. It first reads your image as double like yours did. Then it tries to determine "white" pixels by checking that each color channel (of which can be either 1, 3 or 4) is brighter than 0.5. Then your input data points to the clustering will be the x and y "coordinates" (i.e. indices) of your white pixels.
Next it does the clustering via kmeans. This part of the code is loosely based on the already cited answer of Amro. I had to set a large maximal number of iterations, as the problem is ill-posed in the sense that there aren't 10 clear clusters in the picture. Then we compute the mean for each cluster, and plot the clusters with gscatter, and the means with scatter. Note that in order to have the picture facing in the right directions in a scatter plot you have to shift around the input coordinates. Alternatively you could define datax and datay correspondingly at the beginning.
And here's my output, run with the already processed figure you provided in your question:
I do believe you must had made a naive mistake in the plot and that's why you see just a straight line: You are plotting only the x values.
In my opinion, the second argument in the scatter command should be X(cluster1,2) or X(cluster2,2) depending on which scatter command is being used in the code.
The code can be made more simple:
%read file
I=im2double(imread('sil10340.pbm'));
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(I);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
[cInd, c] = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(c(:,2),-c(:,1),20,'ko'); %same funky coordinates
I don't think there is nay need for the looping as the c itself return a 10x2 double array which contains the position of the means

Finding defined peaks with Clusters in MATLAB

this is my problem:
I have the next data "A", which looks like:
As you can see, I have drawn with red circles the apparently peaks, the most defined are 2 and 7, I say that they are defined because its standard deviation is low in comparison with the other peaks (especially the second one).
What I need is a way (anyway) to get the values and the standard deviation of n peaks in a numeric array.
I have tried with "clusters", but I got no good results:
First of all, I used "kmeans" MATLAB function, and I realize that this algorithm doesn't group peaks as I need. As you can see in the picture above, in the red circle, that cluster has at less 3 or 4 peaks. And kmeans need that you set the number of clusters, and I need to identify it automatically.
I hope that anyone can give me some ideas, or a way to get better results, thanks.
Pd: I leave the data "A" in the next link.
https://drive.google.com/file/d/0B4WGV21GqSL5a2EyQ2l0SHZURzA/edit?usp=sharing
The problem is that your axes have very different meaning.
K-means optimizes variance. But variance in X is something entirely different than variance in Y, isn't it? Furthermore, each of these methods will split your data in both X and Y, whereas I assume you want the data to be partitioned on the X axis only.
I suggest the following: consider the Y axis to be a weight, and X axis to be a position.
Then perform weighted density estimation, and look for low density to separate your clusters.
I can't help you with MATLAB. I don't use it.
Mathematically, what you want to do is place a Gaussian at each point, with area Y and center X. Then find minima and maxima on the sum of these Gaussians. See Wikipedia, Kernel Density Estimation for details; except that you want to use the Y axis as weights. You could maybe also use 1/Y as standard deviation, if you don't want to use weights.

feature selection using kernel PCA (KPCA)

I have tried Principal component analysis (PCA) for feature selection which gave me 4 optimal features from set of nine features (Mean of Green, Variance of Green, Std. div. of Green , Mean of Red, Variance of Red, Std. div. of Red, Mean of Hue, Variance of Hue, Std. div. of Hue, i.e. [ MGcorr,VarGcorr, stdGcorr,MRcorr,VarRcorr,stdRcorr,MHcorr,VarHcorr,stdHcorr ]) for classification of data into two clusters. From Literature, it seems that PCA is not very good method but rather is better to apply kernel PCA (KPCA) for the feature selection. I want to apply KPCA for feature selection and I have tried following:
d=4; % number of features to be selected, or d: reduced dimension
[Y2 eigVector para ]=kPCA(feature,d); % feature is 300X9 matrix with 300 as number of
% observation and 9 features
% Y: dimensionanlity-reduced data
Above kPCA.m function can be downloaded from :
http://www.mathworks.com/matlabcentral/fileexchange/39715-kernel-pca-and-pre-image-reconstruction/content/kPCA_v1.0/code/kPCA.m
In above implementation I want to know how to find which 4 features from 9 features to select (i.e. which top features are optimal) for clustering.
Alternatively, I also tried following function for KPCA implementation:
options.KernelType = 'Gaussian';
options.t = 1;
options.ReducedDim = 4;
[eigvector, eigvalue] = KPCA(feature', options);
In above implementation also I have same problem in determining the 4 top /optimal features from set of 9 features.
Above KPCA.m function can be downloaded from :
http://www.cad.zju.edu.cn/home/dengcai/Data/code/KPCA.m
That will be great if someone will help me in implementing kernel PCA for my problem.
Thanks
PCA doesn't provide optimal features per se. What it provides is a new set of features that are uncorrelated. When you select the "best" 4 features, you are picking the ones that have the greatest variance (largest eigenvalues). So for "normal" PCA, you simply select the 4 eigenvectors corresponding to the 4 largest eigenvalues, then you project the original 9 features onto those eigenvectors via matrix multiplication.
From the link you provided for the kernel PCA function, the return value Y2 appears to be the original data transformed to the top d features of the kernel-PCA space, so the transformation is already done for you.