How to apply K-mean algorithm to multidimensional array? - matlab

I have a matrix A = (a1,a2,a3,...,an)',where a1, a2,..., an are row vectors. I want to apply k-means algorithm to matrix A in order to cluster the row vector ai (i=1,2,3...,n) to k clusters or more. Suppose b1, b2, b3,...,bk are the centers of k clusters, k samples are randomly selected to be the initial centers of k clusters. All the samples (a1,a2,a3,...,an) are classified according to their cosine distance to the centers bi (i=1,2,3,...,k) into k classes, that is, k clusters. The centers of k clusters are recalculated, all samples are reclassified until the centers do not change, and then the final centers b1,b2,b3,...,bk are obtained. For each cluster, only the vector closest to the center of cluster is retained. How to realize this?

The kmeans function (in the Statistics and Machine learning toolbox) performs exactly this. Simply use:
C = kmeans(A, k, 'Distance', 'cosine')
to get the desired output.
Best,

Related

Clustering of 1 dimensional data

I am trying to learn the k-means clustering algorithm in MATLAB without using inbuilt k-means function. Say I have the data of size 1x100 and I want to group them into two clusters. So how can I do this. I want to visualize the two centroids and data together on a plot in MATLAB.
Note : When I plot in MATLAB, I am able to see only data but not the data and two centroids simultaneously.
Any help in this regard is highly appreciated.
A minimal K-means clustering algorithm in matlab could be:
p = rand(100,2); % rand(number_of_points,number_of_dimension)
c = p(1:3,:); % We create 3 centroids
% We run this minimal KNN algorithm:
for ii = 1:10
% Which centroids is the closest for each points ? min(Euclidian_distance):
[~,idx] = min(sum((permute(p,[3,2,1])-c).^2,2),[],1);
% We calculate the new centroids (the center of mass of the corresponding points)
c = splitapply(#mean,p,idx(:))
end
And we can plot the result if needed:
hold on
scatter(p(:,1),p(:,2),[],idx(:))
scatter(c(:,1),c(:,2),[],'red')
And we obtain:
With our 3 centroids in red and the clusters with a distinct color.
Noticed that in this example the data are of dimension 2, but it will also work with any other dimension.
The 3 initial centroids correspond to 3 points of the dataset (randomly selected), it ensure that every centroids are the closest centroid for, at least, 1 point.
In this example there is 10 iterations. But it is certainly better to define a tolerance and stop the iteration when the centroids have converged.

Matlab kmeans clustering for non linearly separable data

I've a non linearly separable data at my hand. I want to cluster it using K-means implementation in matlab. I want to get the cluster labels for each and every data point, to use them for another classification problem.
The problem is k-means is not giving results as expected. I'm attaching the cluster plot I obtained.
I expected k-means to give clusters as concentric circles as the data looks, but output was arcs. I don't understand why is this happening.
Can you suggest me any other clustering method to acheive my goal.
Before using an algorithm, you should try to understand it: what is the goal of an algorithm, and how does it achieve it. For k-means, Wikipedia tells us the following:
k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean
Three concentric circles would have the exact same mean, so k-means is not suitable to separate them. The result is really what you should expect from k-means here.
Now, if you know that your clusters will always be concentric circles, you can simply convert your cartesian (x-y) coordinates to polar coordinates, and use only the radius rho for clustering - as you know that the angle theta doesn't matter:
% Create random data
[x1,y1] = pol2cart(2*pi*rand(1000,1),rand(1000,1));
[x2,y2] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+2);
[x3,y3] = pol2cart(2*pi*rand(1000,1),rand(1000,1)+4);
X = [x1,y1; x2,y2; x3,y3];
% Transform to polar
[theta,rho] = cart2pol(X(:,1),X(:,2));
% k-means clustering
idx = kmeans(rho,3);
% Plot results
hold on
plot(X(idx==1,1), X(idx==1,2), 'r.')
plot(X(idx==2,1), X(idx==2,2), 'g.')
plot(X(idx==3,1), X(idx==3,2), 'b.')
Or more generally: use a suitable kernel for k-means clustering, or use another algorithm.

How to understand the Matlab build in function "kmeans"?

Suppose I have a matrix A, the size of which is 2000*1000 double. Then I apply
Matlab build in function "kmeans"to the matrix A.
k = 8;
[idx,C] = kmeans(A, k, 'Distance', 'cosine');
I get C = 8*1000 double; idx = 2000*1 double, with values from 1 to 8;
According to the documentation, C returns the k cluster centroid locations in the k-by-p (8 by 1000) matrix. And idx returns an n-by-1 vector containing cluster indices of each observation.
My question is:
1) I do not know how to understand the C, the centroid locations. Locations should be represented as (x,y), right? How to understand the matrix C correctly?
2) What are the final centers c1, c2,...,ck? Are they just values or locations?
3) For each cluster, if I only want to get the vector closest to the center of this cluster, how to calculate and get it?
Thanks!
Before I answer the three parts, I'll just explain the syntax that is used in MATLAB's explanation of k-means (http://www.mathworks.com/help/stats/kmeans.html).
A is your data matrix (it's represented as X in the link). There are n rows (in this case, 2000), which represent the number of observations/data points that you have. There are also p columns (in this case, 1000), which represent the number of "features" that each data points has. For example, if your data consisted of 2D points, then p would equal 2.
k is the number of clusters that you want to group the data into. Based on the dimensions of C that you gave, k must be 8.
Now I will answer the three parts:
The C matrix has dimensions k x p. Each row represents a centroid. Centroid locations DO NOT have to be (x, y) at all. The dimensions of the centroid locations are equal to p. In other words, if you have 2D points, you could graph the centroids as (x, y). If you have 3D points, you could graph the centroids as (x, y, z). Since each data point in A has 1000 features, your centroids therefore have 1000 dimensions.
This is sort of difficult to explain without knowing what your data is exactly. Centroids are certainly not just values, and they may not necessarily be locations. If your data A were coordinate points, you could certainly represent the centroids as locations. However, we can view it more generally. If you had a cluster centroid i and the data points v that are grouped with that centroid, the centroid would represent the data point that is most similar to those in its cluster. Hopefully, that makes sense, and I can give a clearer explanation if necessary.
The k-means method actually gives us a good way to accomplish this. The function actually has 4 possible outputs, but I will focus on the 4th, which I will call D:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine');
D has dimensions n x k. For a data point i, the row i in the D matrix gives the distance from that point to every centroid. Therefore, for each centroid, you simply need to find the data point closest to this, and return that corresponding data point. I can supply the short code for this if you need it.
Also, just a tip. You should probably use kmeans++ method of initializing the centroids. It's faster and generally better. You can call it using this:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine', 'Start', 'plus');
Edit:
Here is the code necessary for part 3:
[~, min_idxs] = min(D, [], 1);
closest_vecs = A(min_idxs, :);
Each row i of closest_vecs is the vector that is closest to centroid i.
OK, before we actually get into the details, let's give a brief overview on what K-means clustering is first.
k-means clustering works such that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
How you use the kmeans function in MATLAB is that assuming you have a data matrix (A in your case), it is arranged such that each row is a sample and each column is a feature / dimension of a sample. For example, we could have N x 2 or N x 3 arrays of Cartesian coordinates, either in 2D or 3D. In colour images, we could have N x 3 arrays where each column is a colour component in an image - red, green or blue.
How you invoke kmeans in MATLAB is the following way:
[IDX, C] = kmeans(X, K);
X is the data matrix we talked about, K is the total number of clusters / groups you would like to see and the outputs IDX and C are respectively an index and centroid matrix. IDX is a N x 1 array where N is the total number of samples that you have put into the function. Each value in IDX tells you which centroid the sample / row in X best matched with a particular centroid. You can also override the distance measure used to measure the distance between points. By default, this is the Euclidean distance, but you used the cosine distance in your invocation.
C has K rows where each row is a centroid. Therefore, for the case of Cartesian coordinates, this would be a K x 2 or K x 3 array. Therefore, you would interpret IDX as telling which group / centroid that the point is closest to when computing k-means. As such, if we got a value of IDX=1 for a point, this means that the point best matched with the first centroid, which is the first row of C. Similarly, if we got a value of IDX=1 for a point, this means that the point best matched with the third centroid, which is the third row of C.
Now to answer your questions:
We just talked about C and IDX so this should be clear.
The final centres are stored in C. Each row gives you a centroid / centre that is representative of a group.
It sounds like you want to find the closest point to each cluster in the data, besides the actual centroid itself. That's easy to do if you use knnsearch which performs K-Nearest Neighbour search by giving a set of points and it outputs the K closest points within your data that are close to a query point. As such, you supply the clusters as the input and your data as the output, then use K=2 and skip the first point. The first point will have a distance of 0 as this will be equal to the centroid itself and the second point will give you the closest point that is closest to the cluster.
You can do that by the following, assuming you already ran kmeans:
out = knnsearch(A, C, 'k', 2);
out = out(:,2);
You run knnsearch, then toss out the closest point as it would essentially have a distance of 0. The second column is what you're after, which gives you the closest point to the cluster excluding the actual centroid. out will give you which points in your data matrix A that was closest to each centroid. To get the actual points, do this:
pts = A(out,:);
Hope this helps!

plotting multivariate data along eigenvectors

I have a data matrix contains 18 samples, each with 12 variables, D(18,12). I performed k-means clustering on the data to get 3 clusters. I want to visualize this data in 2 dimensions, specifically, along the 2 eigenvectors corresponding to the largest eigenvalues of a specific matrix, B. So, I create the plane spanned by two eigenvectors corresponding to the largest two eigenvalues:
[V,EA]=eig(B);
e1=V(:,11);
e2=V(:,12);
for i=1:12
E(i,1)=e1(i);
E(i,2)=e2(i);
end
Eproj=E*E';
where e1 and e2 are the eigenvectors, and E is a matrix containing those column vectors. At this point, I'm kind of stuck.
I recognize that e1 and e2 are orthogonal in this 12-d space, but I have no idea how this can reduce to two dimensions so I can plot it.
I believe that the projection of a data sample onto the plane would be:
Eproj*D(i,:)
for i=1...18, but I'm not sure where to go from here to plot my clusters. When I do the projection, its still in 12 dimensions.
Principal Component Analysis can help you to transform the data into 2D using the Eigenvectors.
coeff = princomp(B);
Bproj = B * coeff(:,1:2);
figure
plot(Bproj(:,1),Bproj(:,2),'*')
If you have the labels you can use the "scatter" function for a better visual. Or you can reduce the dimensionality to 3 and use "scatter3" function.

Getting the index of closest data point to the centriods in Kmeans clustering in MATLAB

I am doing some clustering using K-means in MATLAB. As you might know the usage is as below:
[IDX,C] = kmeans(X,k)
where IDX gives the cluster number for each data point in X, and C gives the centroids for each cluster.I need to get the index(row number in the actual data set X) of the closest datapoint to the centroid. Does anyone know how I can do that?
Thanks
The "brute-force approach", as mentioned by #Dima would go as follows
%# loop through all clusters
for iCluster = 1:max(IDX)
%# find the points that are part of the current cluster
currentPointIdx = find(IDX==iCluster);
%# find the index (among points in the cluster)
%# of the point that has the smallest Euclidean distance from the centroid
%# bsxfun subtracts coordinates, then you sum the squares of
%# the distance vectors, then you take the minimum
[~,minIdx] = min(sum(bsxfun(#minus,X(currentPointIdx,:),C(iCluster,:)).^2,2));
%# store the index into X (among all the points)
closestIdx(iCluster) = currentPointIdx(minIdx);
end
To get the coordinates of the point that is closest to the cluster center k, use
X(closestIdx(k),:)
The brute force approach would be to run k-means, and then compare each data point in the cluster to the centroid, and find the one closest to it. This is easy to do in matlab.
On the other hand, you may want to try the k-medoids clustering algorithm, which gives you a data point as the "center" of each cluster. Here is a matlab implementation.
Actually, kmeans already gives you the answer, if I understand you right:
[IDX,C, ~, D] = kmeans(X,k); % D is the distance of each datapoint to each of the clusters
[minD, indMinD] = min(D); % indMinD(i) is the index (in X) of closest point to the i-th centroid