Output of IDX in Kmeans? - matlab

I have a 1000x6 dataset and using the below kmeans script is fine but when I want to output one of the clusters it only comes out as one column?
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',6);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
% Output cluster data to K datasets
K1 = data(clustIDX==1)
K2 = data(clustIDX==2)... etc
Shouldnt K1 = data(clustIDX==1) output the full row information? Not just one column but six like the original dataset? Or is this just outputting the distances?

Replace
K1 = data(clustIDX==1)
K2 = data(clustIDX==2)
with
K1 = data(clustIDX==1,:)
K2 = data(clustIDX==2,:)
The first one retrieves only the first column of corresponding rows. The second one should fix it, I've tried and it works.

Related

PCA of Ovarian Cancer Data via SVD

I want to analyze the Ovarian Cancer Data provided by MATLAB with the PCA. Specifically, I want to visualize the two largest Principal Components, and draw the two corresponding left singular vectors. As I understand, those vectors should be able to serve as a new coordinate-system, aligned towards the largest variance in the data. What I ultimately want to examine is if the cancer patients are distinguishable from the non-cancer patients.
Something that is still wrong in my script are the left singular vectors. They are not in a 90 degree angle to each other, and if I scale them by the respective eigenvalues, they explode in length. What am I doing wrong?
%% PCA - Ovarian Cancer Data
close all;
clear all;
% obs is an NxM matrix, where ...
% N = patients (216)
% M = features - genes in this case (4000)
load ovariancancer.mat;
% Turn obs matrix, such that the rows represent the features
X = obs.';
[U, S, V] = svd(X, 'econ');
% Crop U, S and V, to visualize two largest principal components
U_crop = U(:, 1:2);
S_crop = S(1:2, 1:2);
V_crop = V(:, 1:2);
X_crop = U_crop * S_crop * V_crop.';
% Average over cancer patients
xC = mean(X_crop, 2);
% Visualize two largest principal components as a data cloud
figure;
hold on;
for i = 1 : size(X, 2)
if grp{i} == 'Cancer'
plot(X_crop(1, i), X_crop(2, i), 'rx', 'LineWidth', 2);
else
plot(X_crop(1, i), X_crop(2, i), 'bo', 'LineWidth', 2);
end
end
%scatter(X_crop(1, :), X_crop(2, :), 'k.', 'LineWidth', 2)
set(gca,'DataAspectRatio',[1 1 1])
xlabel('PC1')
ylabel('PC2')
grid on;
Xstd = U_crop; % * S_crop?
quiver([xC(1) xC(1)], [xC(2) xC(2)], Xstd(1, :), Xstd(2, :), 'green', 'LineWidth', 3);
So there were multiple mistakes in my script. In case anyone is interested, I am posting the corrected code (I am plotting three PCs now). This post was very helpful.
% obs is an NxM matrix, where ...
% N = patients (216)
% M = features - genes in this case (4000)
load ovariancancer.mat;
% Let the data matrix X be of n×p size, where n is the number of samples and p is the number of variables
X = obs;
% Let us assume that it is centered, i.e. column means have been subtracted and are now equal to zero
Xavg = mean(X, 2);
%X = X - Xavg * ones(1, size(X, 2));
[U, S, V] = svd(X, 'econ');
PC = U * S;
% Visualize three largest principal components as a data cloud
% The j-th principal component is given by j-th column of XV. The coordinates of the i-th data point in the new PC space are given by the i-th row of XV
figure;
for i = 1 : size(PC, 2)
if grp{i} == 'Cancer'
plot3(PC(i, 1), PC(i, 2), PC(i, 3), 'rx', 'LineWidth', 2);
else
plot3(PC(i, 1), PC(i, 2), PC(i, 3), 'bo', 'LineWidth', 2);
end
hold on;
end
set(gca,'DataAspectRatio',[1 1 1])
xlabel('PC1')
ylabel('PC2')
zlabel('PC3')

Convex hull / concave hull for multiple clusters in data

I have done a lot of reading on drawing polygons around clusters and realized convhull maybe the best way forward. Basically I am looking for a elastic like polygon to wrap around my cluster points.
My data is matrix consisting of x (1st column) and y(2nd column) points which are grouped in clusters (3rd column). I have 700 such clusters hence not feasible to plot each separately.
Is there a way to perform convhull for each cluster separately and then plot each of them on a single chart.
EDIT
Code I have written until now which isn't able to run convex hull on each individual cluster...
[ndata, text, alldata] = xlsread(fullfile(source_dir));
[~, y] = sort(ndata(:,end));
As = ndata(y,:);
lon = As(:,1);
lat = As(:,2);
cluster = As(:,3);
%% To find number of points in a cluster (repetitions)
rep = zeros(size(cluster));
for j = 1:length(cluster)
rep(j) = sum(cluster==cluster(j));
end
%% Less than 3 points in a cluster are filtered out
x = lon (rep>3);
y = lat (rep>3);
z = cluster (rep>3);
%% convex hull for each cluster plotted ....hold....then display all.
figure
hold on
clusters = unique(z);
for i = 1:length(z)
k=convhull(x(z==clusters(i)), y(z==clusters(i)));
plot(x, y, 'b.'); %# plot cluster points
plot(x(k),y(k),'r-'); %# plots only k indices, giving the convex hull
end
Below is an image of what is being displayed;
If this question has already been asked I apologize for repetition but please do direct me to the answer you'll see fit.
Please can anyone help with this, however trivial I'm really struggling!
I would iterate through all the clusters and do what you already written, and use the hold on option to accumulate all the plots in the same plot. Something like this:
% Generate three clouds of points in 2D:
c1 = bsxfun(#plus, 0.5 * randn(50,2), [1 3]);
c2 = bsxfun(#plus, 0.6 * randn(20,2), [0 0]);
c3 = bsxfun(#plus, 0.4 * randn(20,2), [1 1]);
data = [c1, ones(50,1); ...
c2, 2*ones(20,1); ...
c3, 3*ones(20,1)];
% Plot the data points with different colors
clf
plot(c1(:,1), c1(:,2),'r+', 'LineWidth', 2);
hold on
plot(c2(:,1), c2(:,2),'k+', 'LineWidth', 2);
plot(c3(:,1), c3(:,2),'b+', 'LineWidth', 2);
x = data(:,1);
y = data(:,2);
cluster = data(:,3);
clusters = unique(cluster);
for i = 1:length(clusters)
px = x(cluster == clusters(i));
py = y(cluster == clusters(i));
if length(px) > 2
k = convhull(px, py);
plot(px(k), py(k), '-');
end
end
It gives the following result:

match data sample matlab

Ok this is going to sound really confusing but I will try my best to make it clear enough. I have a full dataset called fulldata this dataset is 494021x6.
I use svds (singular value decomposition) on it like so:
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
I then randomly select 1000 rows from the fulldata:
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick columns in a set order (2,4,5,3,6,1)
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
I then apply normalization to this randomly selected 1000 rows:
% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));
I then output a datasample from the original fulldata set which matches the 1000 selected rows:
% output matching data
dataSample = fulldata(indX, :)
Also note that when I picked "random rows" I also output the indX rows which match the rows in the fulldata.
So datasample looks like this:
Which is the 1000 random rows which match the original fulldata.
And indX looks like this:
Which is the corresponding row number from fulldata.
The problem im arriving at is when I use K-Means to cluster the 1000 random rows and I output the data of each cluster like so:
%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
% output the contents of each cluster
K1 = data(clustIDX==1,:)
K2 = data(clustIDX==2,:)
K3 = data(clustIDX==3,:)
K4 = data(clustIDX==4,:)
K5 = data(clustIDX==5,:)
K6 = data(clustIDX==6,:)
How can I match K1, k2... K6 to the corresponding indX row number? For instance K1's output looks like this:
I was hoping to have extra files like K1-indX which is just a list of corresponding row numbers from indX which match the cluster data from K1, K2... etc. Or possibly append the indX row number into the K1, K2 output in column 7 (preferable)
For instance:
K1 cluster data | Belongs to fulldata row number
0.4 0.5 0.6 0.4 | 456456 etc
An example to illustrate:
%# lets use an example data of size 150x4
load fisheriris
fulldata = meas;
%# pick 100 rows at random
rIdx = randperm(size(fulldata,1));
rIdx = rIdx(1:100)'; %#'
data = fulldata(rIdx,:);
%# cluster the subset data
K = 3;
clustIDX = kmeans(data, K);
%# divide the data according to which cluster instances were assigned to
groupedIdx = cell(K,1);
groupedData = cell(K,1);
for i=1:K
%# instances
groupedData{i} = data(clustIDX==i,:);
%# corresponding row indices into the original fulldata
groupedIdx{i} = rIdx(clustIDX==i);
end
%# check: these two should be equal
groupedData{1}(1,:)
fulldata(groupedIdx{1}(1),:)
Unless I am mis-interpreting something above, you already have (in indX) the fulldata row numbers... All you need to do to see, for example, the rows from fulldata in cluster 1 is:
fulldata(indX(clustIDX == 1), :)
kmeans does not re-order the data, so each row 1:1000 of clustIDX still corresponds to the same row 1:1000 of data / datasample that you started with.
Said another way, clustIDX is going to be a vector of length 1000 where each element is the (integer) cluster assignment for that row. Thus you can use this for logical indexing anywhere you have 1000 rows in an order corresponding to the sample data you used for clustering.

Show rows on clustered kmeans data

Hi I was wondering when you cluster data on the figure screen is there a way to show which rows the data points belong to when you scroll over them?
From the picture above I was hoping there would be a way in which if I select or scroll over the points that I could tell which row it belonged to.
Here is the code:
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows);
%# pick random columns
indY = randperm( size(fulldata,2) );
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
%% apply normalization method to every cell
data = data./repmat(sqrt(sum(data.^2)),size(data,1),1);
%% generate sample data
K = 6;
numObservarations = 1000;
dimensions = 6;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
Or possibly an output method of the clusters data, normalized and re-organized to there original format with appedicies on the end column with which row it belonged to from the original "fulldata".
You could use the data cursors feature which displays a tooltip when you select a point from the plot. You can use a modified update function to display all sorts of information about the point selected.
Here is a working example:
function customCusrorModeDemo()
%# data
D = load('fisheriris');
data = D.meas;
[clustIdx,labels] = grp2idx(D.species);
K = numel(labels);
clr = hsv(K);
%# instance indices grouped according to class
ind = accumarray(clustIdx, 1:size(data,1), [K 1], #(x){x});
%# plot
%#gscatter(data(:,1), data(:,2), clustIdx, clr)
hLine = zeros(K,1);
for k=1:K
hLine(k) = line(data(ind{k},1), data(ind{k},2), data(ind{k},3), ...
'LineStyle','none', 'Color',clr(k,:), ...
'Marker','.', 'MarkerSize',15);
end
xlabel('SL'), ylabel('SW'), zlabel('PL')
legend(hLine, labels)
view(3), box on, grid on
%# data cursor
hDCM = datacursormode(gcf);
set(hDCM, 'UpdateFcn',#updateFcn, 'DisplayStyle','window')
set(hDCM, 'Enable','on')
%# callback function
function txt = updateFcn(~,evt)
hObj = get(evt,'Target'); %# line object handle
idx = get(evt,'DataIndex'); %# index of nearest point
%# class index of data point
cIdx = find(hLine==hObj, 1, 'first');
%# instance index (index into the entire data matrix)
idx = ind{cIdx}(idx);
%# output text
txt = {
sprintf('SL: %g', data(idx,1)) ;
sprintf('SW: %g', data(idx,2)) ;
sprintf('PL: %g', data(idx,3)) ;
sprintf('PW: %g', data(idx,4)) ;
sprintf('Index: %d', idx) ;
sprintf('Class: %s', labels{clustIdx(idx)}) ;
};
end
end
Here is how it looks like in both 2D and 3D views (with different display styles):

MATLAB - Classification output

My programme uses K-means clustering of a set amount of clusters from the user. For this k=4 but I would like to run the clustered information through matlabs naive bayes classifier afterwards.
Is there a way to split the clusters up and feed them into different naive classifiers in matlab?
Naive Bayes:
class = classify(test,training, target_class, 'diaglinear');
K-means:
%% generate sample data
K = 4;
numObservarations = 5000;
dimensions = 42;
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
something like outputing k clusters to a format k1,k2,k3 then having the naive classifier pick those up, instead of test it would be k1,k2.. etc
class = classify(k1,training, target_class, 'diaglinear');
But I just dont know how to send the output of the k clusters in matlab to some type of format? (really new to this programme)
EDIT
training = [1;0;-1;-2;4;0]; % this is the sample data.
target_class = ['posi';'zero';'negi';'negi';'posi';'zero'];% This should have the same number of rows as training data. The elements and the class on the same row should correspond.
% target_class are the different target classes for the training data; here 'positive' and 'negetive' are the two classes for the given training data
% Training and Testing the classifier (between positive and negative)
test = 10*randn(10,1) % this is for testing. I am generating random numbers.
class = classify(test,training, target_class, 'diaglinear') % This command classifies the test data depening on the given training data using a Naive Bayes classifier
% diaglinear is for naive bayes classifier; there is also diagquadratic
Try this:
% create 100 random points (this is the training data)
X = rand(100,3);
% cluster into 5 clusters
K = 5;
[IDX, C] = kmeans(X, K);
% now let us say you have new data and you want
% to classify it based on the training:
SAMPLE = rand(10,3);
CLASS = classify(SAMPLE,X,IDX);
And if you just want to filter out one of the clusters out of the data you can do something like that:
K1 = X(IDX==1)
Hope that was helpful..