Hi I was wondering when you cluster data on the figure screen is there a way to show which rows the data points belong to when you scroll over them?
From the picture above I was hoping there would be a way in which if I select or scroll over the points that I could tell which row it belonged to.
Here is the code:
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows);
%# pick random columns
indY = randperm( size(fulldata,2) );
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
%% apply normalization method to every cell
data = data./repmat(sqrt(sum(data.^2)),size(data,1),1);
%% generate sample data
K = 6;
numObservarations = 1000;
dimensions = 6;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
Or possibly an output method of the clusters data, normalized and re-organized to there original format with appedicies on the end column with which row it belonged to from the original "fulldata".
You could use the data cursors feature which displays a tooltip when you select a point from the plot. You can use a modified update function to display all sorts of information about the point selected.
Here is a working example:
function customCusrorModeDemo()
%# data
D = load('fisheriris');
data = D.meas;
[clustIdx,labels] = grp2idx(D.species);
K = numel(labels);
clr = hsv(K);
%# instance indices grouped according to class
ind = accumarray(clustIdx, 1:size(data,1), [K 1], #(x){x});
%# plot
%#gscatter(data(:,1), data(:,2), clustIdx, clr)
hLine = zeros(K,1);
for k=1:K
hLine(k) = line(data(ind{k},1), data(ind{k},2), data(ind{k},3), ...
'LineStyle','none', 'Color',clr(k,:), ...
'Marker','.', 'MarkerSize',15);
end
xlabel('SL'), ylabel('SW'), zlabel('PL')
legend(hLine, labels)
view(3), box on, grid on
%# data cursor
hDCM = datacursormode(gcf);
set(hDCM, 'UpdateFcn',#updateFcn, 'DisplayStyle','window')
set(hDCM, 'Enable','on')
%# callback function
function txt = updateFcn(~,evt)
hObj = get(evt,'Target'); %# line object handle
idx = get(evt,'DataIndex'); %# index of nearest point
%# class index of data point
cIdx = find(hLine==hObj, 1, 'first');
%# instance index (index into the entire data matrix)
idx = ind{cIdx}(idx);
%# output text
txt = {
sprintf('SL: %g', data(idx,1)) ;
sprintf('SW: %g', data(idx,2)) ;
sprintf('PL: %g', data(idx,3)) ;
sprintf('PW: %g', data(idx,4)) ;
sprintf('Index: %d', idx) ;
sprintf('Class: %s', labels{clustIdx(idx)}) ;
};
end
end
Here is how it looks like in both 2D and 3D views (with different display styles):
Related
I have written a 2D histogram algorithm for 2 matlab vectors. Unfortunately, I cannot figure out how to vectorize it, and it is about an order of magnitude too slow for my needs. Here is what I have:
function [ result ] = Hist2D( vec0, vec1 )
%Hist2D takes two vectors, and computes the two dimensional histogram
% of those images. It assumes vectors are non-negative, and bins
% are the integers.
%
% OUTPUTS
% result -
% size(result) = 1 + [max(vec0) max(vec1)]
% result(i,j) = number of pixels that have value
% i-1 in vec0 and value j-1 in vec1.
result = zeros(max(vec0)+1, max(vec1)+1);
fvec0 = floor(vec1)+1;
fvec1 = floor(vec0)+1;
% UGH, This is gross, there has to be a better way...
for i = 1 : size(fvec0);
result(fvec0(i), fvec1(i)) = 1 + result(fvec0(i), fvec1(i));
end
end
Thoughts?
Thanks!!
John
Here is my version for a 2D histogram:
%# some random data
X = randn(2500,1);
Y = randn(2500,1)*2;
%# bin centers (integers)
xbins = floor(min(X)):1:ceil(max(X));
ybins = floor(min(Y)):1:ceil(max(Y));
xNumBins = numel(xbins); yNumBins = numel(ybins);
%# map X/Y values to bin indices
Xi = round( interp1(xbins, 1:xNumBins, X, 'linear', 'extrap') );
Yi = round( interp1(ybins, 1:yNumBins, Y, 'linear', 'extrap') );
%# limit indices to the range [1,numBins]
Xi = max( min(Xi,xNumBins), 1);
Yi = max( min(Yi,yNumBins), 1);
%# count number of elements in each bin
H = accumarray([Yi(:) Xi(:)], 1, [yNumBins xNumBins]);
%# plot 2D histogram
imagesc(xbins, ybins, H), axis on %# axis image
colormap hot; colorbar
hold on, plot(X, Y, 'b.', 'MarkerSize',1), hold off
Note that I removed the "non-negative" restriction, but kept integer bin centers (this could be easily changed into dividing range into equally-sized specified number of bins instead "fractions").
This was mainly inspired by #SteveEddins blog post.
You could do something like:
max0 = max(fvec0) + 1;
max1 = max(fvec1) + 1;
% Combine the vectors
combined = fvec0 + fvec1 * max0;
% Generate a 1D histogram
hist_1d = hist(combined, max0*max1);
% Convert back to a 2D histogram
hist_2d = reshape(hist, [max0 max1]);
(Note: untested)
This is a continuation from the question already posted here. I used the method that #Andrey suggested. But there seems to be a limitation. the set(handle, 'XData', x) command seems to work as long as x is a vector. what if x is a matrix?
Let me explain with an example.
Say we want to draw 3 rectangles whose vertices are given by the matrices x_vals (5,3 matrix) and y_vals (5,3 matrix). The command that will be used to plot is simply plot(x,y).
Now, we want to update the above plot. This time we want to draw 4 rectangles. whose vertices are present in the matrices x_new(5,4 matrix) and y_new (5,4 matrix) that we obtain after some calculations. Now using the command set(handle, 'XData', x, 'YData', y) after updating x and y with new values results in an error that states
Error using set
Value must be a column or row vector
Any way to solve this problem?
function [] = visualizeXYZ_struct_v3(super_struct, start_frame, end_frame)
% create first instance
no_objs = length(super_struct(1).result);
x = zeros(1,3000);
y = zeros(1,3000);
box_x = zeros(5, no_objs);
box_y = zeros(5, no_objs);
fp = 1;
% cascade values across structures in a frame so it can be plot at once;
for i = 1:1:no_objs
XYZ = super_struct(1).result(i).point_xyz;
[r,~] = size(XYZ);
x(fp:fp+r-1) = XYZ(:,1);
y(fp:fp+r-1) = XYZ(:,2);
% z(fp:fp+r-1) = xyz):,3);
fp = fp + r;
c = super_struct(1).result(i).box;
box_x(:,i) = c(:,1);
box_y(:,i) = c(:,2);
end
x(fp:end) = [];
y(fp:end) = [];
fig = figure('position', [50 50 1280 720]);
hScatter = scatter(x,y,1);
hold all
hPlot = plot(box_x,box_y,'r');
axis([-10000, 10000, -10000, 10000])
xlabel('X axis');
ylabel('Y axis');
hold off
grid off
title('Filtered Frame');
tic
for num = start_frame:1:end_frame
no_objs = length(super_struct(num).result);
x = zeros(1,3000);
y = zeros(1,3000);
box_x = zeros(5, no_objs);
box_y = zeros(5, no_objs);
fp = 1;
% cascade values accross structures in a frame so it can be plot at once;
for i = 1:1:no_objs
XYZ = super_struct(num).result(i).point_xyz;
[r,~] = size(XYZ);
x(fp:fp+r-1) = XYZ(:,1);
y(fp:fp+r-1) = XYZ(:,2);
fp = fp + r;
c = super_struct(num).result(i).box;
box_x(:,i) = c(:,1);
box_y(:,i) = c(:,2);
end
x(fp:end) = [];
y(fp:end) = [];
set(hScatter, 'XData', x, 'YData', y);
set(hPlot, 'XData', box_x, 'YData', box_y); % This is where the error occurs
end
toc
end
Each line on the plot has its own XData and YData properties, and each can be set to a vector individually. See the reference. I am not at a Matlab console right now, but as I recall...
kidnum = 1
h_axis = gca % current axis - lines are children of the axis
kids = get(h_axis,'Children')
for kid = kids
kid_type = get(kid,'type')
if kid_type == 'line'
set(kid,'XData',x_new(:,kidnum))
set(kid,'YData',y_new(:,kidnum))
kidnum = kidnum+1
end
end
Hope that helps! See also the overall reference to graphics objects and properties.
To add a series, say
hold on % so each "plot" won't touch the lines that are already there
plot(x_new(:,end), y_new(:,end)) % or whatever parameters you want to plot
After that, the new series will be a child of h_axis and can be modified.
Ok this is going to sound really confusing but I will try my best to make it clear enough. I have a full dataset called fulldata this dataset is 494021x6.
I use svds (singular value decomposition) on it like so:
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
I then randomly select 1000 rows from the fulldata:
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick columns in a set order (2,4,5,3,6,1)
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
I then apply normalization to this randomly selected 1000 rows:
% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));
I then output a datasample from the original fulldata set which matches the 1000 selected rows:
% output matching data
dataSample = fulldata(indX, :)
Also note that when I picked "random rows" I also output the indX rows which match the rows in the fulldata.
So datasample looks like this:
Which is the 1000 random rows which match the original fulldata.
And indX looks like this:
Which is the corresponding row number from fulldata.
The problem im arriving at is when I use K-Means to cluster the 1000 random rows and I output the data of each cluster like so:
%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
% output the contents of each cluster
K1 = data(clustIDX==1,:)
K2 = data(clustIDX==2,:)
K3 = data(clustIDX==3,:)
K4 = data(clustIDX==4,:)
K5 = data(clustIDX==5,:)
K6 = data(clustIDX==6,:)
How can I match K1, k2... K6 to the corresponding indX row number? For instance K1's output looks like this:
I was hoping to have extra files like K1-indX which is just a list of corresponding row numbers from indX which match the cluster data from K1, K2... etc. Or possibly append the indX row number into the K1, K2 output in column 7 (preferable)
For instance:
K1 cluster data | Belongs to fulldata row number
0.4 0.5 0.6 0.4 | 456456 etc
An example to illustrate:
%# lets use an example data of size 150x4
load fisheriris
fulldata = meas;
%# pick 100 rows at random
rIdx = randperm(size(fulldata,1));
rIdx = rIdx(1:100)'; %#'
data = fulldata(rIdx,:);
%# cluster the subset data
K = 3;
clustIDX = kmeans(data, K);
%# divide the data according to which cluster instances were assigned to
groupedIdx = cell(K,1);
groupedData = cell(K,1);
for i=1:K
%# instances
groupedData{i} = data(clustIDX==i,:);
%# corresponding row indices into the original fulldata
groupedIdx{i} = rIdx(clustIDX==i);
end
%# check: these two should be equal
groupedData{1}(1,:)
fulldata(groupedIdx{1}(1),:)
Unless I am mis-interpreting something above, you already have (in indX) the fulldata row numbers... All you need to do to see, for example, the rows from fulldata in cluster 1 is:
fulldata(indX(clustIDX == 1), :)
kmeans does not re-order the data, so each row 1:1000 of clustIDX still corresponds to the same row 1:1000 of data / datasample that you started with.
Said another way, clustIDX is going to be a vector of length 1000 where each element is the (integer) cluster assignment for that row. Thus you can use this for logical indexing anywhere you have 1000 rows in an order corresponding to the sample data you used for clustering.
I have a 1000x6 dataset and using the below kmeans script is fine but when I want to output one of the clusters it only comes out as one column?
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',6);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
% Output cluster data to K datasets
K1 = data(clustIDX==1)
K2 = data(clustIDX==2)... etc
Shouldnt K1 = data(clustIDX==1) output the full row information? Not just one column but six like the original dataset? Or is this just outputting the distances?
Replace
K1 = data(clustIDX==1)
K2 = data(clustIDX==2)
with
K1 = data(clustIDX==1,:)
K2 = data(clustIDX==2,:)
The first one retrieves only the first column of corresponding rows. The second one should fix it, I've tried and it works.
I have two clusters of data each cluster has x,y (coordinates) and a value to know it's type(1 class1,2 class 2).I have plotted these data but i would like to split these classes with boundary(visually). what is the function to do such thing. i tried contour but it did not help!
Consider this classification problem (using the Iris dataset):
As you can see, except for easily separable clusters for which you know the equation of the boundary beforehand, finding the boundary is not a trivial task...
One idea is to use the discriminant analysis function classify to find the boundary (you have a choice between linear and quadratic boundary).
The following is a complete example to illustrate the procedure. The code requires the Statistics Toolbox:
%# load Iris dataset (make it binary-class with 2 features)
load fisheriris
data = meas(:,1:2);
labels = species;
labels(~strcmp(labels,'versicolor')) = {'non-versicolor'};
NUM_K = numel(unique(labels)); %# number of classes
numInst = size(data,1); %# number of instances
%# visualize data
figure(1)
gscatter(data(:,1), data(:,2), labels, 'rb', '*o', ...
10, 'on', 'sepal length', 'sepal width')
title('Iris dataset'), box on, axis tight
%# params
classifierType = 'quadratic'; %# 'quadratic', 'linear'
npoints = 100;
clrLite = [1 0.6 0.6 ; 0.6 1 0.6 ; 0.6 0.6 1];
clrDark = [0.7 0 0 ; 0 0.7 0 ; 0 0 0.7];
%# discriminant analysis
%# classify the grid space of these two dimensions
mn = min(data); mx = max(data);
[X,Y] = meshgrid( linspace(mn(1),mx(1),npoints) , linspace(mn(2),mx(2),npoints) );
X = X(:); Y = Y(:);
[C,err,P,logp,coeff] = classify([X Y], data, labels, classifierType);
%# find incorrectly classified training data
[CPred,err] = classify(data, data, labels, classifierType);
bad = ~strcmp(CPred,labels);
%# plot grid classification color-coded
figure(2), hold on
image(X, Y, reshape(grp2idx(C),npoints,npoints))
axis xy, colormap(clrLite)
%# plot data points (correctly and incorrectly classified)
gscatter(data(:,1), data(:,2), labels, clrDark, '.', 20, 'on');
%# mark incorrectly classified data
plot(data(bad,1), data(bad,2), 'kx', 'MarkerSize',10)
axis([mn(1) mx(1) mn(2) mx(2)])
%# draw decision boundaries between pairs of clusters
for i=1:NUM_K
for j=i+1:NUM_K
if strcmp(coeff(i,j).type, 'quadratic')
K = coeff(i,j).const;
L = coeff(i,j).linear;
Q = coeff(i,j).quadratic;
f = sprintf('0 = %g + %g*x + %g*y + %g*x^2 + %g*x.*y + %g*y.^2',...
K,L,Q(1,1),Q(1,2)+Q(2,1),Q(2,2));
else
K = coeff(i,j).const;
L = coeff(i,j).linear;
f = sprintf('0 = %g + %g*x + %g*y', K,L(1),L(2));
end
h2 = ezplot(f, [mn(1) mx(1) mn(2) mx(2)]);
set(h2, 'Color','k', 'LineWidth',2)
end
end
xlabel('sepal length'), ylabel('sepal width')
title( sprintf('accuracy = %.2f%%', 100*(1-sum(bad)/numInst)) )
hold off