I am trying to learn the k-means clustering algorithm in MATLAB without using inbuilt k-means function. Say I have the data of size 1x100 and I want to group them into two clusters. So how can I do this. I want to visualize the two centroids and data together on a plot in MATLAB.
Note : When I plot in MATLAB, I am able to see only data but not the data and two centroids simultaneously.
Any help in this regard is highly appreciated.
A minimal K-means clustering algorithm in matlab could be:
p = rand(100,2); % rand(number_of_points,number_of_dimension)
c = p(1:3,:); % We create 3 centroids
% We run this minimal KNN algorithm:
for ii = 1:10
% Which centroids is the closest for each points ? min(Euclidian_distance):
[~,idx] = min(sum((permute(p,[3,2,1])-c).^2,2),[],1);
% We calculate the new centroids (the center of mass of the corresponding points)
c = splitapply(#mean,p,idx(:))
end
And we can plot the result if needed:
hold on
scatter(p(:,1),p(:,2),[],idx(:))
scatter(c(:,1),c(:,2),[],'red')
And we obtain:
With our 3 centroids in red and the clusters with a distinct color.
Noticed that in this example the data are of dimension 2, but it will also work with any other dimension.
The 3 initial centroids correspond to 3 points of the dataset (randomly selected), it ensure that every centroids are the closest centroid for, at least, 1 point.
In this example there is 10 iterations. But it is certainly better to define a tolerance and stop the iteration when the centroids have converged.
Related
I have a temporal dataset(1000000x70) consisting of info about the activities of 20 subjects. I need to apply subsampling to the dataset as it has more than a million rows. How to select a set of observations of each subject ideally from it? Later, I need to apply PCA and K-means on it. Kindly help me with the steps to be followed. I'm working in MATLAB.
I'm not really clear on what you're looking for. If you just want to subsample a matrix on matlab, here is a way to do it:
myData; % 70 x 1000000 data
nbDataPts = size(myData, 2); % Get the number of points in the data
subsampleRatio = 0.1; % Ratio of data you want to keep
nbSamples = round(subsampleRatio * nbDataPts); % How many points to keep
sampleIdx = round(linspace(1, nbDataPts, nbSamples)); % Evenly space indices of the points to keep
sampledData = myData(:, sampleIdx); % Sampling data
Then if you want to apply PCA and K means I suggest you take a look at the relevant documentation:
PCA
K means
Try to work with it, and open a new question if a specific problem arises.
I have some trouble on predicting KNN classifier without using built-in function. I got stuck here and had no idea how to go to next step. Here is my code:
% calculate Euclidean distance
dist = pdist2(test, train, 'euclidean');
for k = [1 3 5 7]
[~, nearest] = sort(dist, 2);
nearst = nearest(:, 1:k);
end % for loop
Where test is a 297x64 matrix, and train is a 1500x64 matrix. The dist matrix is 297x1500. Any help will be thankful!
So you managed to get sorted indices in terms of distances in your nearst, all you have to do is to refer to your labels of the original data. So you have somewhere a variable labels which holds a true label for each point. Use indices stored in nearst to read them out and simply report the most common value.
I've a non linearly separable data at my hand. I want to cluster it using K-means implementation in matlab. I want to get the cluster labels for each and every data point, to use them for another classification problem.
The problem is k-means is not giving results as expected. I'm attaching the cluster plot I obtained.
I expected k-means to give clusters as concentric circles as the data looks, but output was arcs. I don't understand why is this happening.
Can you suggest me any other clustering method to acheive my goal.
Before using an algorithm, you should try to understand it: what is the goal of an algorithm, and how does it achieve it. For k-means, Wikipedia tells us the following:
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean
Three concentric circles would have the exact same mean, so k-means is not suitable to separate them. The result is really what you should expect from k-means here.
Now, if you know that your clusters will always be concentric circles, you can simply convert your cartesian (x-y) coordinates to polar coordinates, and use only the radius rho for clustering - as you know that the angle theta doesn't matter:
% Create random data
[x1,y1] = pol2cart(2*pi*rand(1000,1),rand(1000,1));
[x2,y2] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+2);
[x3,y3] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+4);
X = [x1,y1; x2,y2; x3,y3];
% Transform to polar
[theta,rho] = cart2pol(X(:,1),X(:,2));
% k-means clustering
idx = kmeans(rho,3);
% Plot results
hold on
plot(X(idx==1,1), X(idx==1,2), 'r.')
plot(X(idx==2,1), X(idx==2,2), 'g.')
plot(X(idx==3,1), X(idx==3,2), 'b.')
Or more generally: use a suitable kernel for k-means clustering, or use another algorithm.
I have this matrix:
x = [2+2*i 2-2*i -2+2*i -2-2*i];
I want to simulate transmitting it and adding noise to it. I represented the components of the complex number as below:
A = randn(150, 2) + 2*ones(150, 2); C = randn(150, 2) - 2*ones(150, 2);
At the receiver, I received the below vector, where the components are ordered based on what I sent originally, i.e., the components of x).
X = [A A A C C A C C];
Now I want to apply the kmeans(X) to have four clusters, so kmeans(X, 4). I am experiencing the following problems:
I am not sure if I can represent the complex numbers as shown in X above.
I can't plot the result of the kmeans to show the clusters.
I could not understand the clusters centroid results.
How can I find the best error rate, if this example was to represent a communication system and at the receiver, k-means clustering was used in order to decide what the transmitted signal was?
If you don't "understand" the cluster centroid results, then you don't understand how k-means works. I'll present a small summary here.
How k-means works is that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
Now, let's answer your questions one-by-one.
1. Complex number representation
k-means in MATLAB doesn't define how complex data is handled. A common way for people to deal with complex numbered data is to split up the real and imaginary parts into separate dimensions as you have done. This is a perfectly valid way to use k-means for complex valued data.
See this post on the MathWorks MATLAB forum for more details: https://www.mathworks.com/matlabcentral/newsreader/view_thread/78306
2. Plot the results
You aren't constructing your matrix X properly. Note that A and C are both 150 x 2 matrices. You need to structure X such that each row is a point, and each column is a variable. Therefore, you need to concatenate your A and C row-wise. Therefore:
X = [A; A; A; C; C; A; C; C];
Note that you have duplicate points. This is actually no different than doing X = [A; C]; as far as kmeans is concerned. Perhaps you should generate X, then add the noise in rather than taking A and C, adding noise, then constructing your signal.
Now, if you want to plot the results as well as the centroids, what you need to do is use the two output version of kmeans like so:
[idx, centroids] = kmeans(X, 4);
idx will contain the cluster number that each point in X belongs to, and centroids will be a 4 x 2 matrix where each row tells you the mean of each cluster found in the data. If you want to plot the data, as well as the clusters, you simply need to do following. I'm going to loop over each cluster membership and plot the results on a figure. I'm also going to colour in where the mean of each cluster is located:
x = X(:,1);
y = X(:,2);
figure;
hold on;
colors = 'rgbk';
for num = 1 : 4
plot(x(idx == num), y(idx == num), [colors(num) '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
grid;
The above code goes through each cluster, plots them in a different colour, then plots the centroids in cyan with a slightly larger thickness so you can see what the graph looks like.
This is what I get:
3. Understanding centroid results
This is probably because you didn't construct X properly. This is what I get for my centroids:
centroids =
-1.9176 -2.0759
1.5980 2.8071
2.7486 1.6147
0.8202 0.8025
This is pretty self-explanatory and I talked about how this is structured earlier.
4. Best representation of the signal
What you can do is repeat the clustering a number of times, then the algorithm will decide what the best clustering was out of these times. You would simply use the Replicates flag and denote how many times you want this run. Obviously, the more times you run this, the better your results may be. Therefore, do something like:
[idx, centroids] = kmeans(X, 4, 'Replicates', 5);
This will run kmeans 5 times and give you the best centroids of these 5 times.
Now, if you want to determine what the best sequence that was transmitted, you'd have to split up your X into 150 rows each (as your random sequence was 150 elements), then run a separate kmeans on each subset. You can try to find the best representation of each part of the sequence by using the Replicates flag each time.... so you can do something like:
for num = 1 : 8
%// Look at 150 points at a time
[idx, centroids] = kmeans(X((num-1)*150 + 1 : num*150, :), 4, 'Replicates', 5);
%// Do your analysis
%//...
%//...
end
idx and centroids would be the results for each portion of your transmitted signal. You probably want to look at centroids at each iteration to determine what symbol was transmitted at a particular time.
If you want to plot the decision regions, then you're probably looking for a Voronoi diagram. All you do is given a set of points that are defined within the domain of your problem, you just have to determine which cluster each point belongs to. Given that our data spans between -5 <= (x,y) <= 5, let's go through each point in the grid and determine which cluster each point belongs to. We'd then colour the appropriate point according to which cluster it belongs to.
Something like:
colors = 'rgbk';
[X,Y] = meshgrid(-5:0.05:5, -5:0.05:5);
X = X(:);
Y = Y(:);
figure;
hold on;
for idx = 1 : numel(X)
[~,ind] = min(sum(bsxfun(#minus, [X(idx) Y(idx)], centroids).^2, 2));
plot(X(idx), Y(idx), [colors(ind), '.']);
end
plot(centroids(:,1), centroids(:,2), 'c.', 'MarkerSize', 14);
The above code will plot the decision regions / Voronoi diagram of the particular configuration, as well as where the cluster centres are located. Note that the code is rather unoptimized and it'll take a while for the graph to generate, but I wanted to write something quick to illustrate my point.
Here's what the decision regions look like:
Hope this helps! Good luck!
I am doing some clustering using K-means in MATLAB. As you might know the usage is as below:
[IDX,C] = kmeans(X,k)
where IDX gives the cluster number for each data point in X, and C gives the centroids for each cluster.I need to get the index(row number in the actual data set X) of the closest datapoint to the centroid. Does anyone know how I can do that?
Thanks
The "brute-force approach", as mentioned by #Dima would go as follows
%# loop through all clusters
for iCluster = 1:max(IDX)
%# find the points that are part of the current cluster
currentPointIdx = find(IDX==iCluster);
%# find the index (among points in the cluster)
%# of the point that has the smallest Euclidean distance from the centroid
%# bsxfun subtracts coordinates, then you sum the squares of
%# the distance vectors, then you take the minimum
[~,minIdx] = min(sum(bsxfun(#minus,X(currentPointIdx,:),C(iCluster,:)).^2,2));
%# store the index into X (among all the points)
closestIdx(iCluster) = currentPointIdx(minIdx);
end
To get the coordinates of the point that is closest to the cluster center k, use
X(closestIdx(k),:)
The brute force approach would be to run k-means, and then compare each data point in the cluster to the centroid, and find the one closest to it. This is easy to do in matlab.
On the other hand, you may want to try the k-medoids clustering algorithm, which gives you a data point as the "center" of each cluster. Here is a matlab implementation.
Actually, kmeans already gives you the answer, if I understand you right:
[IDX,C, ~, D] = kmeans(X,k); % D is the distance of each datapoint to each of the clusters
[minD, indMinD] = min(D); % indMinD(i) is the index (in X) of closest point to the i-th centroid