clustering on the base of k means centroid - matlab

I have images of 1500 patients's lungs, And I am trying to apply kmean on them to solve my issue. My problem is, I want to apply k mean on one patient (has 230 images ) then saving the centroid of this patient, i want to apply kmeans on other patients based on this centroid. This is the matlab code.
[idx,C] = kmeans(data,80)
Now, I have C but what should I do to use it and apply this centroid on the other images as well?
Here' what my data looks like, I am clustering based upon the histograms of these images.
Img1 histogram with 16 bins
Img2 Histo gram with 16 bins
Img3 // // // // // // //
Img4 // / / / // /// // /
.
.
.
Any tutorial or anything that might help, please suggest. Thank you.

in Kmeans the membership of each point determined by the closest center. Therefore, after you have the centers you can keep associating more points by checking their distance from each center. in matlab you can easily do it with pdist2:
dim = 2;
n = 100;
% generate two data sets
data1 = rand(n,dim);
data2 = rand(n,dim);
% computing membership & clusters using kmeans on data1
k = 5;
[idx1,C] = kmeans(data1,k);
% computing membership using pairwise distance on data2
D = pdist2(data2,C);
[~,idx2] = min(D,[],2);
% plot centers
scatter(C(:,1),C(:,2),100,1:k,'*')
hold on
% plot data1
scatter(data1(:,1),data1(:,2),30,idx1,'filled')
% plot data2
scatter(data2(:,1),data2(:,2),30,idx2)
legend('centers','data1','data2')
if you want you can even plot the membership limits using Voronoi diagram:
voronoi(C(:,1),C(:,2));

Related

Creating Clusters in matlab

Suppose that I have generated some data in matlab as follows:
n = 100;
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
plot(x,y,'rx')
axis([0 100 0 1])
Now I want to generate an algorithm to classify all these data into some clusters(which are arbitrary) in a way such that a point be a member of a cluster only if the distance between this point and at least one of the members of the cluster be less than 10.How could I generate the code?
The clustering method you are describing is DBSCAN. Note that this algorithm will find only one cluster in provided data, since it's very unlikely that there is a point in the dataset so that its distance to all other points is more than 10.
If this is really what you want, you can use ِDBSCAN, or the one posted in FE, if you are using versions older than 2019a.
% Generating random points, almost similar to the data provided by OP
data = bsxfun(#times, rand(100, 2), [100 1]);
% Adding more random points
for i=1:5
mu = rand(1, 2)*100 -50;
A = rand(2)*5;
sigma = A*A'+eye(2)*(1+rand*2);%[1,1.5;1.5,3];
data = [data;mvnrnd(mu,sigma,20)];
end
% clustering using DBSCAN, with epsilon = 10, and min-points = 1 as
idx = DBSCAN(data, 10, 1);
% plotting clusters
numCluster = max(idx);
colors = lines(numCluster);
scatter(data(:, 1), data(:, 2), 30, colors(idx, :), 'filled')
title(['No. of Clusters: ' num2str(numCluster)])
axis equal
The numbers in above figure shows the distance between closest pairs of points in any two different clusters.
The Matlab built-in function clusterdata() works well for what you're asking.
Here is how to apply it to your example:
% number of points
n = 100;
% create the data
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
% the number of clusters you want to create
num_clusters = 5;
T1 = clusterdata(data,'Criterion','distance',...
'Distance','euclidean',...
'MaxClust', num_clusters)
scatter(x, y, 100, T1,'filled')
In this case, I used 5 clusters and used the Euclidean distance to be the metric to group the data points, but you can always change that (see documentation of clusterdata())
See the result below for 5 clusters with some random data.
Note that the data is skewed (x-values are from 0 to 100, and y-values are from 0 to 1), so the results are also skewed, but you could always normalize your data.
Here is a way using the connected components of graph:
D = pdist2(x, y) < 10;
D(1:size(D,1)+1:end) = 0;
G = graph(D);
C = conncomp(G);
The connected components is vector that shows the cluster numbers.
Use pdist2 to compute distance matrix of x and y.
Use the distance matrix to create a logical adjacency matrix that shows two point are neighbors if distance between them is less than 10.
Set the diagonal elements of the adjacency matrix to 0 to eliminate self loops.
Create a graph from the adjacency matrix.
Compute the connected components of graph.
Note that using pdist2 for large datasets may not be applicable and you need to use other methods to form a sparse adjacency matrix.
I notified after posing my answer the answer provided by #saastn suggested to use DBSCAN algorithm that nearly follows the same approach.

Pixel subtraction

I am working on code that select set of pixels randomly from gray images, then comparing the intensity of each 2 pixels by subtracting the intensity of pixel in one location from another one in different location.
I have code do random selection, but I am not sure of this code and I do not know how to do pixels subtraction?
thank you in advance..
{
N = 100; % number of random pixels
im = imread('image.bmp');
[nRow,nCol,c] = size(im);
randRow = randi(nRow,[N,1]);
randCol = randi(nCol,[N,1]);
subplot(2,1,1)
imagesc(im(randRow,randCol,:))
subplot(2,1,2)
imagesc(im)
}
Parag basically gave you the answer. In order to achieve this vectorized, you need to use sub2ind. However, what I would do is generate two sets of rows and columns. The reason why is because you need one set for the first set of pixels and another set for the next set of pixels so you can subtract the two sets of intensities. Therefore, do something like this:
N = 100; % number of random pixels
im = imread('image.bmp');
[nRow,nCol,c] = size(im);
%// Generate two sets of locations
randRow1 = randi(nRow,[N,1]);
randCol1 = randi(nCol,[N,1]);
randRow2 = randi(nRow,[N,1]);
randCol2 = randi(nCol,[N,1]);
%// Convert each 2D location into a single linear index
%// for vectorization, then subtract
locs1 = sub2ind([nRow, nCol], randRow1, randCol1);
locs2 = sub2ind([nRow, nCol], randRow2, randCol2);
im_subtract = im(locs1) - im(locs2);
subplot(2,1,1)
imagesc(im_subtract);
subplot(2,1,2)
imagesc(im);
However, the above code only assumes that your image is grayscale. If you want to do this for colour, you'll have to do a bit more work. You need to access each channel and subtract on a channel basis. The linear indices that were defined above are just for a single channel. As such, you'll need to offset by nRow*nCol for each channel if you want to access the same corresponding locations in the next channels. As such, I would use sub2ind in combination with bsxfun to properly generate the right values for vectorizing the subtraction. This requires just a slight modification to the above code. Therefore:
N = 100; % number of random pixels
im = imread('image.bmp');
[nRow,nCol,c] = size(im);
%// Generate two sets of locations
randRow1 = randi(nRow,[N,1]);
randCol1 = randi(nCol,[N,1]);
randRow2 = randi(nRow,[N,1]);
randCol2 = randi(nCol,[N,1]);
%// Convert each 2D location into a single linear index
%// for vectorization
locs1 = sub2ind([nRow, nCol], randRow1, randCol1);
locs2 = sub2ind([nRow, nCol], randRow2, randCol2);
%// Extend to as many channels as we have
skip_ind = permute(0:nRow*nCol:(c-1)*nRow*nCol, [1 3 2]);
%// Create 3D linear indices
locs1 = bsxfun(#plus, locs1, skip_ind);
locs2 = bsxfun(#plus, locs2, skip_ind);
%// Now subtract the locations
im_subtract = im(locs1) - im(locs2);
subplot(2,1,1)
imagesc(im_subtract);
subplot(2,1,2)
imagesc(im);

KNN algo in matlab

I am working on thumb recognition system. I need to implement KNN algorithm to classify my images. according to this, it has only 2 measurements, through which it is calculating the distance to find the nearest neighbour but in my case I have 400 images of 25 X 42, in which 200 are for training and 200 for testing. I am searching for few hours but I am not finding the way to find the distance between the points.
EDIT:
I have reshaped 1st 200 images in to 1 X 1050 and stored them in a matrix trainingData of 200 X 1050. similarly I made testingData.
Here is an illustration code for k-nearest neighbor classification (some functions used require the Statistics toolbox):
%# image size
sz = [25,42];
%# training images
numTrain = 200;
trainData = zeros(numTrain,prod(sz));
for i=1:numTrain
img = imread( sprintf('train/image_%03d.jpg',i) );
trainData(i,:) = img(:);
end
%# testing images
numTest = 200;
testData = zeros(numTest,prod(sz));
for i=1:numTest
img = imread( sprintf('test/image_%03d.jpg',i) );
testData(i,:) = img(:);
end
%# target class (I'm just using random values. Load your actual values instead)
trainClass = randi([1 5], [numTrain 1]);
testClass = randi([1 5], [numTest 1]);
%# compute pairwise distances between each test instance vs. all training data
D = pdist2(testData, trainData, 'euclidean');
[D,idx] = sort(D, 2, 'ascend');
%# K nearest neighbors
K = 5;
D = D(:,1:K);
idx = idx(:,1:K);
%# majority vote
prediction = mode(trainClass(idx),2);
%# performance (confusion matrix and classification error)
C = confusionmat(testClass, prediction);
err = sum(C(:)) - sum(diag(C))
If you want to compute the Euclidean distance between vectors a and b, just use Pythagoras. In Matlab:
dist = sqrt(sum((a-b).^2));
However, you might want to use pdist to compute it for all combinations of vectors in your matrix at once.
dist = squareform(pdist(myVectors, 'euclidean'));
I'm interpreting columns as instances to classify and rows as potential neighbors. This is arbitrary though and you could switch them around.
If have a separate test set, you can compute the distance to the instances in the training set with pdist2:
dist = pdist2(trainingSet, testSet, 'euclidean')
You can use this distance matrix to knn-classify your vectors as follows. I'll generate some random data to serve as example, which will result in low (around chance level) accuracy. But of course you should plug in your actual data and results will probably be better.
m = rand(nrOfVectors,nrOfFeatures); % random example data
classes = randi(nrOfClasses, 1, nrOfVectors); % random true classes
k = 3; % number of neighbors to consider, 3 is a common value
d = squareform(pdist(m, 'euclidean')); % distance matrix
[neighborvals, neighborindex] = sort(d,1); % get sorted distances
Take a look at the neighborvals and neighborindex matrices and see if they make sense to you. The first is a sorted version of the earlier d matrix, and the latter gives the corresponding instance numbers. Note that the self-distances (on the diagonal in d) have floated to the top. We're not interested in this (always zero), so we'll skip the top row in the next step.
assignedClasses = mode(neighborclasses(2:1+k,:),1);
So we assign the most common class among the k nearest neighbors!
You can compare the assigned classes with the actual classes to get an accuracy score:
accuracy = 100 * sum(classes == assignedClasses)/length(classes);
fprintf('KNN Classifier Accuracy: %.2f%%\n', 100*accuracy)
Or make a confusion matrix to see the distribution of classifications:
confusionmat(classes, assignedClasses)
yes, there is a function for knn : knnclassify
Play around with the number of neighbors you want to keep in order to get the best result (use a confusion matrix). This function takes care of the distance, of course.

Measuring weighted mean length from an electrophoresis gel image

Background:
My question relates to extracting feature from an electrophoresis gel (see below). In this gel, DNA is loaded from the top and allowed to migrate under a voltage gradient. The gel has sieves so smaller molecules migrate further than longer molecules resulting in the separation of DNA by length. So higher up the molecule, the longer it is.
Question:
In this image there are 9 lanes each with separate source of DNA. I am interested in measuring the mean location (value on the y axis) of each lane.
I am really new to image processing, but I do know MATLAB and I can get by with R with some difficulty. I would really appreciate it if someone can show me how to go about finding the mean of each lane.
Here's my try. It requires that the gels are nice (i.e. straight lanes and the gel should not be rotated), but should otherwise work fairly generically. Note that there are two image-size-dependent parameters that will need to be adjusted to make this work on images of different size.
%# first size-dependent parameter: should be about 1/4th-1/5th
%# of the lane width in pixels.
minFilterWidth = 10;
%# second size-dependent parameter for filtering the
%# lane profiles
gaussWidth = 5;
%# read the image, normalize to 0...1
img = imread('http://img823.imageshack.us/img823/588/gele.png');
img = rgb2gray(img);
img = double(img)/255;
%# Otsu thresholding to (roughly) find lanes
thMsk = img < graythresh(img);
%# count the mask-pixels in each columns. Due to
%# lane separation, there will be fewer pixels
%# between lanes
cts = sum(thMsk,1);
%# widen the local minima, so that we get a nice
%# separation between lanes
ctsEroded = imerode(cts,ones(1,minFilterWidth));
%# use imregionalmin to identify the separation
%# between lanes. Invert to get a positive mask
laneMsk = ~repmat(imregionalmin(ctsEroded),size(img,1),1);
Image with lanes that will be used for analysis
%# for each lane, create an averaged profile
lblMsk = bwlabel(laneMsk);
nLanes = max(lblMsk(:));
profiles = zeros(size(img,1),nLanes);
midLane = zeros(1,nLanes);
for i = 1:nLanes
profiles(:,i) = mean(img.*(lblMsk==i),2);
midLane(:,i) = mean(find(lblMsk(1,:)==i));
end
%# Gauss-filter the profiles (each column is an
%# averaged intensity profile
G = exp(-(-gaussWidth*5:gaussWidth*5).^2/(2*gaussWidth^2));
G=G./sum(G);
profiles = imfilter(profiles,G','replicate'); %'
%# find the minima
[~,idx] = min(profiles,[],1);
%# plot
figure,imshow(img,[])
hold on, plot(midLane,idx,'.r')
Here's my stab at a simple template for an interactive way to do this:
% Load image
img = imread('gel.png');
img = rgb2gray(img);
% Identify lanes
imshow(img)
[x,y] = ginput;
% Invert image
img = max(img(:)) - img;
% Subtract background
[xn,yn] = ginput(1);
noise = img((yn-2):(yn+2), (xn-2):(xn+2));
noise = mean(noise(:));
img = img - noise;
% Calculate means
means = (1:size(img,1)) * double(img(:,round(x))) ./ sum(double(img(:,round(x))), 1);
% Plot
hold on
plot(x, means, 'r.')
The first thing to do to is convert your RGB image to grayscale:
gr = rgb2gray(imread('gelk.png'));
Then, take a look at the image intensity histogram using imhist. Notice anything funny about it? Use imcontrast(imshow(gr)) to pull up the contrast adjustment tool. I found that eliminating the weird stuff after the major intensity peak was beneficial.
The image processing task itself can be divided into several steps.
Separate each lane
Identify ('segment') the band in each lane
Calculate the location of the bands
Step 1 can be done "by hand," if the lane widths are guaranteed. If not, the line detection offered by the Hough transform is probably the way to go. The documentation on the Image Processing Toolbox has a really nice tutorial on this topic. My code recapitulates that tutorial with better parameters for your image. I only spent a few minutes with them, I'm sure you can improve the results by tuning the parameters further.
Step 2 can be done in a few ways. The easiest technique to use is Otsu's method for thresholding grayscale images. This method works by determining a threshold that minimizes the intra-class variance, or, equivalently, maximizes the inter-class variance. Otsu's method is present in MATLAB as the graythresh function. If Otsu's method isn't working well you can try multi-level Otsu or a number of other histogram based threshold determination methods.
Step 3 can be done as you suggest, by calculating the mean y value of the segmented band pixels. This is what my code does, though I've restricted the check to just the center column of each lane, in case the separation was off. I'm worried that the result may not be as good as calculating the band centroid and using its location.
Here is my solution:
function [locations, lanesBW, lanes, cols] = segmentGel(gr)
%%# Detect lane boundaries
unsharp = fspecial('unsharp'); %# Sharpening filter
I = imfilter(gr,unsharp); %# Apply filter
bw = edge(I,'canny',[0.01 0.3],0.5); %# Canny edges with parameters
[H,T,R] = hough(bw); %# Hough transform of edges
P = houghpeaks(H,20,'threshold',ceil(0.5*max(H(:)))); %# Find peaks of Hough transform
lines = houghlines(bw,T,R,P,'FillGap',30,'MinLength',20); %# Use peaks to identify lines
%%# Plot detected lines above image, for quality control
max_len = 0;
imshow(I);
hold on;
for k = 1:length(lines)
xy = [lines(k).point1; lines(k).point2];
plot(xy(:,1),xy(:,2),'LineWidth',2,'Color','green');
%# Plot beginnings and ends of lines
plot(xy(1,1),xy(1,2),'x','LineWidth',2,'Color','yellow');
plot(xy(2,1),xy(2,2),'x','LineWidth',2,'Color','red');
%# Determine the endpoints of the longest line segment
len = norm(lines(k).point1 - lines(k).point2);
if ( len > max_len)
max_len = len;
end
end
hold off;
%%# Use first endpoint of each line to separate lanes
cols = zeros(length(lines),1);
for k = 1:length(lines)
cols(k) = lines(k).point1(1);
end
cols = sort(cols); %# The lines are in no particular order
lanes = cell(length(cols)-1,1);
for k = 2:length(cols)
lanes{k-1} = im2double( gr(:,cols(k-1):cols(k)) ); %# im2double for compatibility with greythresh
end
otsu = cellfun(#graythresh,lanes); %# Calculate threshold for each lane
lanesBW = cell(size(lanes));
for k = 1:length(lanes)
lanesBW{k} = lanes{k} < otsu(k); %# Apply thresholds
end
%%# Use segmented bands to determine migration distance
locations = zeros(size(lanesBW));
for k = 1:length(lanesBW)
width = size(lanesBW{k},2);
[y,~] = find(lanesBW{k}(:,round(width/2))); %# Only use center of lane
locations(k) = mean(y);
end
I suggest you carefully examine not only each output value, but the results from each step of the function, before using it for actual research purposes. In order to get really good results, you will have to read a bit about Hough transforms, Canny edge detection and Otsu's method, and then tune the parameters. You may also have to alter how the lanes are split; this code assumes that there will be lines detected on either side of the image.
Let me add another implementation similar in concept to that of #JohnColby's, only without the manual user-interaction:
%# read image
I = rgb2gray(imread('gele.png'));
%# middle position of each lane
%# (assuming lanes are somewhat evenly spread and of similar width)
x = linspace(1,size(I,2),10);
x = round( (x(1:end-1)+x(2:end))./2 );
%# compute the mean value across those columns
m = mean(I(:,x));
%# find the y-indices of the mean values
[~,idx] = min( bsxfun(#minus, double(I(:,x)), m) );
%# show the result
figure(1)
imshow(I, 'InitialMagnification',100, 'Border','tight')
hold on, plot(x, idx, ...
'Color','r', 'LineStyle','none', 'Marker','.', 'MarkerSize',10)
and applied on the smaller image:

MATLAB: draw centroids

My main question is given a feature centroid, how can I draw it in MATLAB?
In more detail, I have an NxNx3 image (an RGB image) of which I take 4x4 blocks and compute a 6-dimensional feature vector for each block. I store these feature vectors in an Mx6 matrix on which I run kmeans function and obtain the centroids in a kx6 matrix, where k is the number of clusters and 6 is the number of features for each block.
How can I draw these center clusters in my image in order to visualize if the algorithm is performing the way I wish it to perform? Or if anyone has any other way/suggestions on how I can visualize the centroids on my image, I'd greatly appreciate it.
Here's one way you can visualize the clusters:
As you described, first I extract the blocks, compute the feature vector for each, and cluster this features matrix.
Next we can visualize the clusters assigned to each block. Note that I am assuming that the 4x4 blocks are distinct, this is important so that we can map the blocks to their location back in the original image.
Finally, in order to display the cluster centroids on the image, I simply find the closest block to each cluster and display it as a representative of that cluster.
Here's a complete example to show the above idea (in your case, you would want to replace the function that computes the features of each block by your own implementation; I am simply taking the min/max/mean/median/Q1/Q3 as my feature vector for each 4x4 block):
%# params
NUM_CLUSTERS = 3;
BLOCK_SIZE = 4;
featureFunc = #(X) [min(X); max(X); mean(X); prctile(X, [25 50 75])];
%# read image
I = imread('peppers.png');
I = double( rgb2gray(I) );
%# extract blocks as column
J = im2col(I, [BLOCK_SIZE BLOCK_SIZE], 'distinct'); %# 16-by-NumBlocks
%# compute features for each block
JJ = featureFunc(J)'; %'# NumBlocks-by-6
%# cluster blocks according to the features extracted
[clustIDX, ~, ~, Dist] = kmeans(JJ, NUM_CLUSTERS);
%# display the cluster index assigned for each block as an image
cc = reshape(clustIDX, ceil(size(I)/BLOCK_SIZE));
RGB = label2rgb(cc);
imshow(RGB), hold on
%# find and display the closest block to each cluster
[~,idx] = min(Dist);
[r c] = ind2sub(ceil(size(I)/BLOCK_SIZE), idx);
for i=1:NUM_CLUSTERS
text(c(i)+2, r(i), num2str(i), 'fontsize',20)
end
plot(c, r, 'k.', 'markersize',30)
legend('Centroids')
The centroids do not correspond to coordinates in the image, but to coordinates in the feature space. There is two ways you can test how well kmeans performed. For both ways, you want to fist associate the points with their closest cluster. You get this information from the first output of kmeans.
(1) You can visualize the clustering result by reducing the 6-dimensional space to 2 or 3-dimensional space and then plotting the differently classified coordinates in different colors.
Assuming that the feature vectors are collected in an array called featureArray, and that you asked for nClusters clusters, you'd do the plot as follows using mdscale to transform the data to, say, 3D space:
%# kmeans clustering
[idx,centroids6D] = kmeans(featureArray,nClusters);
%# find the dissimilarity between features in the array for mdscale.
%# Add the cluster centroids to the points, so that they get transformed by mdscale as well.
%# I assume that you use Euclidean distance.
dissimilarities = pdist([featureArray;centroids6D]);
%# transform onto 3D space
transformedCoords = mdscale(dissimilarities,3);
%# create colormap with nClusters colors
cmap = hsv(nClusters);
%# loop to plot
figure
hold on,
for c = 1:nClusters
%# plot the coordinates
currentIdx = find(idx==c);
plot3(transformedCoords(currentIdx,1),transformedCoords(currentIdx,2),...
transformedCoords(currentIdx,3),'.','Color',cmap(c,:));
%# plot the cluster centroid with a black-edged square
plot3(transformedCoords(1:end-nClusters+c,1),transformedCoords(1:end-nClusters+c,2),...
transformedCoords(1:end-nClusters+c,3),'s','MarkerFaceColor',cmap(c,:),...
MarkerEdgeColor','k');
end
(2) You can, alternatively, create a pseudo-colored image that shows you what part of the image belongs to which cluster
Assuming that you have nRows by nCols blocks, you write
%# kmeans clustering
[idx,centroids6D] = kmeans(featureArray,nClusters);
%# create image
img = reshape(idx,nRows,nCols);
%# create colormap
cmap = hsv(nClusters);
%# show the image and color according to clusters
figure
imshow(img,[])
colormap(cmap)