Matlab:K-means clustering

Matlab:K-means clustering - matlab

I have a matrice of A(369x10) which I want to cluster in 19 clusters.
I use this method
[idx ctrs]=kmeans(A,19)
which yields
idx(369x1) and ctrs(19x10)
I get the point up to here.All my rows in A is clustered in 19 clusters.
Now I have an array B(49x10).I want to know where the rows of this B corresponds in the among given 19 clusters.
How is it possible in MATLAB?
Thank you in advance

The following is a a complete example on clustering:
%% generate sample data
K = 3;
numObservarations = 100;
dimensions = 3;
data = rand([numObservarations dimensions]);
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)

I can't think of a better way to do it than what you described. A built-in function would save one line, but I couldn't find one. Here's the code I would use:
[ids ctrs]=kmeans(A,19);
D = dist([testpoint;ctrs]); %testpoint is 1x10 and D will be 20x20
[distance testpointID] = min(D(1,2:end));

I don't know if I get your meaning right, but if you want to know which cluster your points belong you can use KnnSearch function easily. It has two arguments and will search in first argument for the first one of them that is closest to argument two.

Assuming you're using squared euclidean distance metric, try this:
for i = 1:size(ctrs,2)
d(:,i) = sum((B-ctrs(repmat(i,size(B,1),1),:)).^2,2);
end
[distances,predicted] = min(d,[],2)
predicted should then contain the index of the closest centroid, and distances should contain the distances to the closest centroid.
Take a look inside the kmeans function, at the subfunction 'distfun'. This shows you how to do the above, and also contains the equivalents for other distance metrics.

for small amount of data, you could do
[testpointID,dum] = find(permute(all(bsxfun(#eq,B,permute(ctrs,[3,2,1])),2),[3,1,2]))
but this is somewhat obscure; the bsxfun with the permuted ctrs creates a 49 x 10 x 19 array of booleans, which is then 'all-ed' across the second dimension, permuted back and then the row ids are found. again, probably not practical for large amounts of data.

Related

Creating Clusters in matlab

Suppose that I have generated some data in matlab as follows:
n = 100;
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
plot(x,y,'rx')
axis([0 100 0 1])
Now I want to generate an algorithm to classify all these data into some clusters(which are arbitrary) in a way such that a point be a member of a cluster only if the distance between this point and at least one of the members of the cluster be less than 10.How could I generate the code?

The clustering method you are describing is DBSCAN. Note that this algorithm will find only one cluster in provided data, since it's very unlikely that there is a point in the dataset so that its distance to all other points is more than 10.
If this is really what you want, you can use ِDBSCAN, or the one posted in FE, if you are using versions older than 2019a.
% Generating random points, almost similar to the data provided by OP
data = bsxfun(#times, rand(100, 2), [100 1]);
% Adding more random points
for i=1:5
mu = rand(1, 2)*100 -50;
A = rand(2)*5;
sigma = A*A'+eye(2)*(1+rand*2);%[1,1.5;1.5,3];
data = [data;mvnrnd(mu,sigma,20)];
end
% clustering using DBSCAN, with epsilon = 10, and min-points = 1 as
idx = DBSCAN(data, 10, 1);
% plotting clusters
numCluster = max(idx);
colors = lines(numCluster);
scatter(data(:, 1), data(:, 2), 30, colors(idx, :), 'filled')
title(['No. of Clusters: ' num2str(numCluster)])
axis equal
The numbers in above figure shows the distance between closest pairs of points in any two different clusters.

The Matlab built-in function clusterdata() works well for what you're asking.
Here is how to apply it to your example:
% number of points
n = 100;
% create the data
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
% the number of clusters you want to create
num_clusters = 5;
T1 = clusterdata(data,'Criterion','distance',...
'Distance','euclidean',...
'MaxClust', num_clusters)
scatter(x, y, 100, T1,'filled')
In this case, I used 5 clusters and used the Euclidean distance to be the metric to group the data points, but you can always change that (see documentation of clusterdata())
See the result below for 5 clusters with some random data.
Note that the data is skewed (x-values are from 0 to 100, and y-values are from 0 to 1), so the results are also skewed, but you could always normalize your data.

Here is a way using the connected components of graph:
D = pdist2(x, y) < 10;
D(1:size(D,1)+1:end) = 0;
G = graph(D);
C = conncomp(G);
The connected components is vector that shows the cluster numbers.
Use pdist2 to compute distance matrix of x and y.
Use the distance matrix to create a logical adjacency matrix that shows two point are neighbors if distance between them is less than 10.
Set the diagonal elements of the adjacency matrix to 0 to eliminate self loops.
Create a graph from the adjacency matrix.
Compute the connected components of graph.
Note that using pdist2 for large datasets may not be applicable and you need to use other methods to form a sparse adjacency matrix.
I notified after posing my answer the answer provided by #saastn suggested to use DBSCAN algorithm that nearly follows the same approach.

Matlab: Fit of histogram data with many Gaussians and AIC evaluation

Consider this example of code to obtain the best fit from data varying the number of fitting Gaussians according the Akaike criterion
MU1 = [1];
SIGMA1 = [2];
MU2 = [-3];
SIGMA2 = [1 ];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
AIC = zeros(1,4);
obj = cell(1,4);
options = statset('Display','final');
for k = 1:4
obj{k} = gmdistribution.fit(X,k,'Options',options);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC)
I want to do the same thing but with data that are given in a form of a histogram (consider for example the data http://pastebin.com/embed_js.php?i=1mNRuEHZ).
What is the most direct way to implement the same procedure in matlab in this case?

If I'm getting you right, then your problem is to convert between data that is already compiled as a histogram (so numbers of observations paired with the actual value of an observation) and the original individual observations. Of course, when compiling the histogram, you have lost two things:
Order. You don't know what the order of observations was in the original data, which is probably not important, provided your observations are independent. Also, the way I get gmdistribution.fit() it doesn't take into account order anyway.
Resolution. When you create a histogram, you need to bin your data, which makes you lose precision, so to speak, because it is impossible to recover the precise values of your observations from the bins.
Once you are aware of that you can create a 'vector of observations' from your histogram data. Say, X1 is your histogram data (Nx2 vector). If you do
invX = cell2mat(arrayfun(#(x,y) repmat(y,1,x), abs(int16(1000*X1(:, 2)))', X1(:, 1)', ...
'UniformOutput', false))';
you get a vector that contains individual observations, just like X in your example.
Note that you have to convert the bin counts to integers first. At this step, because the given data's precision is quite high, I had to round to make the computation possible for my machine. However, the final result seems fairly reasonable.
Also note that I used absolute values, there are some cases in your histogram data were your data is actually negative, which, for a histogram obviously doesn't make sense.
Last but not least you have to change the number of iterations for the fit procedure to 1000. The final code to produce the below figure reads
MU1 = [1];
SIGMA1 = [2];
MU2 = [-3];
SIGMA2 = [1 ];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
X = X1(:, 2);
invX = cell2mat(arrayfun(#(x,y) repmat(y,1,x), abs(int16(1000*X1(:, 2)))', X1(:, 1)', ...
'UniformOutput', false))'; %'
X = invX;
AIC = zeros(1,4);
obj = cell(1,4);
options = statset('Display','final', 'MaxIter', 1000);
for k = 1:4
obj{k} = gmdistribution.fit(X,k,'Options',options);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC);
hold on;
plot(linspace(-1, 2, length(X1(:, 2))), abs(X1(:, 2)), 'LineWidth', 2)
plot(x, pd/max(pd)*double(max(abs(X1(:, 2)))), 'LineWidth', 5);
h = legend('Original data', 'PDF');
set(h,'FontSize',32);
Output looks like this:

MATLAB: How to make 2 histograms have the same bin width?

I am plotting 2 histograms of 2 distributions in 1 figure by Matlab. However, the result shows that 2 histograms do not have the same bin width although I use the same number for bins. How can we make 2 histograms have the same bin width?
My code is simple like this:
a = distribution one
b = distribution two
nbins = number of bins
[c,d] = hist(a,nbins);
[e,f] = hist(b,nbins);
%Plotting
bar(d,c);hold on;
bar(f,e);hold off;

This can be done by simply using the bins centres from one call to hist as the bins for the another
for example
[aCounts,aBins] = hist(a,nBins);
[bCounts,bBins] = hist(b,aBins);
note that all(aBins==bBins) = 1
This method however will loose information when the min and max values of the two data sets are not similar*, one simple solution is to create bins based on the combined data
[~ , bins] = hist( [a(:),b(:)] ,nBins);
aCounts = hist( a , bins );
bCounts = hist( b , bins );
*if the ranges are vastly different it may be better to create the vector of bin centres manually
(after re-reading the question) If the bin widths are what you want to control not using the same bins creating the bin centers manually is probably best...
to do this create a vector of bin centres to pass to hist,
for example - note the number of bins is only enforced for one set of data here
aBins = linspace( min(a(:)) ,max(a(:) , nBins);
binWidth = aBins(2)-aBins(1);
bBins = min(a):binWidth:max(b)+binWidth/2
and then use
aCounts = hist( a , aBins );
bCounts = hist( b , bBins );

use histcounts with 'BinWidth' option
https://www.mathworks.com/help/matlab/ref/histcounts.html
i.e
data1 = randn(1000,1)*10;
data2 = randn(1000,1);
[hist1,~] = histcounts(data1, 'BinWidth', 10);
[hist2,~] = histcounts(data2, 'BinWidth', 10);
bar(hist1)
bar(hist2)

The behavior of hist is different when the 2nd argument is a vector instead of a scalar.
Instead of specifying a number of bins, specify the bin limits using a vector, as demonstrated in the documentation (see "Specify Bin Intervals"):
rng(0,'twister')
data1 = randn(1000,1)*10;
rng(1,'twister')
data2 = randn(1000,1);
figure
xvalues1 = -40:40;
[c,d] = hist(data1,xvalues1);
[e,f] = hist(data2,xvalues1);
%Plotting
bar(d,c,'b');hold on;
bar(f,e,'r');hold off;
This results in:

EEG in matlab - Graph theory segmentation

i have an EEG dataset and I want to further examine it with Laplacian Eigenmaps. However, at the moment I want to find the local maxima and save into a new matrix all the different vectors that lie in-between two local maxima (see picture- i am looking for the black lines). I use the findpeaks function in Matlab and I get a matrix with the peaks but from there I do not know how to move on. Thanks in advance!

I am guessing a lot, but are you looking for something like:
%% some data
N = 4; % number of peaks
peakPositions = rand(N,2); % peak positions
%% difference vector matrix
diffMat = zeros(N*(N-1)/2,2);
actPos = 1;
for n = 1:N
diffMat(actPos:actPos+N-n-1,:) = ...
bsxfun(#minus, peakPositions(n+1:end,:), peakPositions(n));
actPos = actPos+N-n;
end
Example:
peakPositions =
0.2630 0.4505
0.6541 0.0838
0.6892 0.2290
0.7482 0.9133
diffMat =
0.3911 -0.1791
0.4262 -0.0340
0.4852 0.6504
0.0351 -0.4251
0.0941 0.2593
0.0589 0.2241

K-means Plotting for 3 Dimensional Data

I'm working with k-means in MATLAB. I am trying to create the plot/graph, but my data has three dimensional array. Here is my k-means code:
clc
clear all
close all
load cobat.txt; % read the file
k=input('Enter a number: '); % determine the number of cluster
isRand=0; % 0 -> sequeantial initialization
% 1 -> random initialization
[maxRow, maxCol]=size(cobat);
if maxRow<=k,
y=[m, 1:maxRow];
elseif k>7
h=msgbox('cant more than 7');
else
% initial value of centroid
if isRand,
p = randperm(size(cobat,1)); % random initialization
for i=1:k
c(i,:)=cobat(p(i),:);
end
else
for i=1:k
c(i,:)=cobat(i,:); % sequential initialization
end
end
temp=zeros(maxRow,1); % initialize as zero vector
u=0;
while 1,
d=DistMatrix3(cobat,c); % calculate the distance
[z,g]=min(d,[],2); % set the matrix g group
if g==temp, % if the iteration doesn't change anymore
break; % stop the iteration
else
temp=g; % copy the matrix to the temporary variable
end
for i=1:k
f=find(g==i);
if f % calculate the new centroid
c(i,:)=mean(cobat(find(g==i),:),1);
end
end
c
[B,index] = sortrows( c ); % sort the centroids
g = index(g); % arrange the labels based on centroids
end
y=[cobat,g]
hold off;
%This plot is actually placed in plot 3D code (last line), but I put it into here, because I think this is the plotting line
f = PlotClusters(cobat,g,y,Colors) %Here is the error
if Dimensions==2
for i=1:NumOfDataPoints %plot data points
plot(cobat(i,1),cobat(i,2),'.','Color',Colors(g(i),:))
hold on
end
for i=1:NumOfCenters %plot the centers
plot(y(i,1),y(i,2),'s','Color',Colors(i,:))
end
else
for i=1:NumOfDataPoints %plot data points
plot3(cobat(i,1),cobat(i,2),cobat(i,3),'.','Color',Colors(g(i),:))
hold on
end
for i=1:NumOfCenters %plot the centers
plot3(y(i,1),y(i,2),y(i,3),'s','Color',Colors(i,:))
end
end
end
And here is the plot 3D code:
%This function plots clustering data, for example the one provided by
%kmeans. To be able to plot, the number of dimensions has to be either 2 or
%3.
%Inputs:
% Data - an m-by-d matrix, where m is the number of data points to
% cluster and d is the number of dimensions. In my code, it is cobat
% IDX - an m-by-1 indices vector, where each element gives the
% cluster to which the corresponding data point in Data belongs. In my file, it is 'g'
% Centers y - an optional c-by-d matrix, where c is the number of
% clusters and d is the dimensions of the problem. The matrix
% gives the location of the cluster centers. If this is not
% given, the centers will be calculated. In my file, I think, it is 'y'
% Colors - an optional color scheme generated by hsv. If this is not
% given, a color scheme will be generated.
%
function f = PlotClusters(cobat,g,y,Colors)
%Checking inputs
switch nargin
case 1 %Not enough inputs
error('Clustering data is required to plot clusters. Usage: PlotClusters(Data,IDX,Centers,Colors)')
case 2 %Need to calculate cluster centers and color scheme
[NumOfDataPoints,Dimensions]=size(cobat);
if Dimensions~=2 && Dimensions~=3 %Check ability to plot
error('It is only possible to plot in 2 or 3 dimensions.')
end
if length(g)~=NumOfDataPoints %Check that each data point is assigned to a cluster
error('The number of data points in Data must be equal to the number of indices in IDX.')
end
NumOfClusters=max(g);
Centers=zeros(NumOfClusters,Dimensions);
NumOfCenters=NumOfClusters;
NumOfPointsInCluster=zeros(NumOfClusters,1);
for i=1:NumOfDataPoints
Centers(g(i),:)=y(g(i),:)+cobat(i,:);
NumOfPointsInCluster(g(i))=NumOfPointsInCluster(g(i))+1;
end
for i=1:NumOfClusters
y(i,:)=y(i,:)/NumOfPointsInCluster(i);
end
Colors=hsv(NumOfClusters);
case 3 %Need to calculate color scheme
[NumOfDataPoints,Dimensions]=size(cobat);
if Dimensions~=2 && Dimensions~=3 %Check ability to plot
error('It is only possible to plot in 2 or 3 dimensions.')
end
if length(g)~=NumOfDataPoints %Check that each data point is assigned to a cluster
error('The number of data points in Data must be equal to the number of indices in IDX.')
end
NumOfClusters=max(g);
[NumOfCenters,Dims]=size(y);
if Dims~=Dimensions
error('The number of dimensions in Data should be equal to the number of dimensions in Centers')
end
if NumOfCenters<NumOfClusters %Check that each cluster has a center
error('The number of cluster centers is smaller than the number of clusters.')
elseif NumOfCenters>NumOfClusters %Check that each cluster has a center
disp('There are more centers than clusters, all will be plotted')
end
Colors=hsv(NumOfCenters);
case 4 %All data is given just need to check consistency
[NumOfDataPoints,Dimensions]=size(cobat);
if Dimensions~=2 && Dimensions~=3 %Check ability to plot
error('It is only possible to plot in 2 or 3 dimensions.')
end
if length(g)~=NumOfDataPoints %Check that each data point is assigned to a cluster
error('The number of data points in Data must be equal to the number of indices in IDX.')
end
NumOfClusters=max(g);
[NumOfCenters,Dims]=size(y);
if Dims~=Dimensions
error('The number of dimensions in Data should be equal to the number of dimensions in Centers')
end
if NumOfCenters<NumOfClusters %Check that each cluster has a center
error('The number of cluster centers is smaller than the number of clusters.')
elseif NumOfCenters>NumOfClusters %Check that each cluster has a center
disp('There are more centers than clusters, all will be plotted')
end
[NumOfColors,RGB]=size(Colors);
if RGB~=3 || NumOfColors<NumOfCenters
error('Colors should have at least the same number of rows as number of clusters and 3 columns')
end
end
%Data is ready. Now plotting
end
Here is the error:
??? Undefined function or variable 'Colors'.
Error in ==> clustere at 69
f = PlotClusters(cobat,g,y,Colors)
Am I wrong call the function like that? What should I do? Your help will be appreciated a lot.

Your code is very messy, and unnecessarily long..
Here is smaller example that does the same thing. You'll need the Statistics toolbox to run it (for the kmeans function and Iris dataset):
%# load dataset of 150 instances and 3 dimensions
load fisheriris
X = meas(:,1:3);
[numInst,numDims] = size(X);
%# K-means clustering
%# (K: number of clusters, G: assigned groups, C: cluster centers)
K = 3;
[G,C] = kmeans(X, K, 'distance','sqEuclidean', 'start','sample');
%# show points and clusters (color-coded)
clr = lines(K);
figure, hold on
scatter3(X(:,1), X(:,2), X(:,3), 36, clr(G,:), 'Marker','.')
scatter3(C(:,1), C(:,2), C(:,3), 100, clr, 'Marker','o', 'LineWidth',3)
hold off
view(3), axis vis3d, box on, rotate3d on
xlabel('x'), ylabel('y'), zlabel('z')

You could simply go for scatter():
As you can see from the image, you differentiate colors, size of the clusters. FOr more details check out the examples in the documentation.

Here is the sample code for how we can get the 3d graph.
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x =[1,2,3,4,5,6,7,8,9,10]
y =[5,6,2,3,13,4,1,2,4,8]
z =[2,3,3,3,5,7,9,11,9,10]
ax.scatter(x, y, z, c='r', marker='o')
ax.set_xlabel('X Label')
ax.set_ylabel('Y Label')
ax.set_zlabel('Z Label')
plt.show()