scipy kmeans iteration meaning? - scipy

I am using the kmeans2 algorithm from scipy to cluster pixel colors in an image to get the top average colors in the image.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html#scipy.cluster.vq.kmeans2
I am confused on the meaning for this parameter:
iter : int
Number of iterations of the k-means algrithm to run. Note that this differs in meaning from the iters parameter to the kmeans function.
If I want the kmeans algorithm to run until the clusters don't change, would I set the iter value high? Is there a way to find a best iter value?

The K-means algorithm works by initializing some K points and clustering your data by their distance from those points. Then it iterates by calculating the centroid of each cluster and redefining clusters by distance from the centroid. This isn't guaranteed to converge quickly, though it often does, so it's asking for a maximum iteration value.
edit: maximum iteration value. is incorrect I think, it is literally going to iterate iter times. The default 10 is a common iter value, though.
The higher the iter value the better the clustering. You could try running K-means on some of your data with various iter values and seeing where the time to compute for some gain in cluster quality is too high for your needs.

Related

In k-means clustering, is the marginal sum of squared distances decreasing?

Suppose I have a set of data, and let SSD(n) be the sum of squared distances when we assume n clusters. My question is the following: Is the marginal SSD always decreasing in n. In other words, is the function f(n) defined as
f(n)=SSD(n)-SSD(n+1)
decreasing in n. This would mean the benefit of adding each additional cluster is decreasing. I am trying to find either a proof or a simple counterexample.
I have done some simulations with random data, and it always seems to be true.

why do k-means clustering different results everytime?

I am using k-means clustering for segmentation of retinal image. However everytime when I run my code segmentation yeilds different results for same image. What is the reason of this change? Following are three segmentation results of same image.
Below is the code used for this segmenation.
idx = kmeans(double(imreslt1(:)),2);
classimage = reshape(idx, size(imreslt1));
minD = min( classimage (:));
maxD = max( classimage (:));
g = (double(classimage ) - minD) ./ (maxD - minD);
imshow(g);
This is the initialization problem for kmeans, as when kmeans starts it picks up the random initial points to cluster your data. Then matlab selects k number of random points and calculates the distance of points in your data to these locations and finds new centroids to further minimize the distance. so because of these random initial points you get different results for centroid locations, but the answer is similar.
If you read the MATLAB help file for the kmeans function, you'll see that the initial points for the k-means clustering algorithm are chosen randomly according to the k-means++ algorithm. To make this reproducible, you can either pass in your own initial points as follows:
kmeans(...,'Start',[random_points_matrix])
or, you could try seeding the MATLAB internal random number generator using the following:
rng(seed); % where seed is some constant you choose
idx = kmean(...);
However, I'm not clear on the internals of the kmean function, so I can't guarantee that this will necessarily produce reproducible results.

Iterative running of K mean clustering algorithm inside clusters

I am using the k mean clustering function of MATLAB. I need to make K=2 every time till all the clusters match a defined condition.
If my data was
all_data=5*[3*rand(1000,1)+5,3*rand(1000,1)+5];
and I run
[IDX C]=kmeans(all_data,2);
I need to check the size of all_data(IDX==1) and all_data(IDX==2) every iteration and continue clustering with k=2 till the size of resulting cluster reach a specific value.
How can this be implemented in an iterative loop in MATLAB?

K-means Clustering, major understanding issue

Suppose that we have a 64dim matrix to cluster, let's say that the matrix dataset is dt=64x150.
Using from vl_feat's library its kmeans function, I will cluster my dataset to 20 centrers:
[centers, assignments] = vl_kmeans(dt, 20);
centers is a 64x20 matrix.
assignments is a 1x150 matrix with values inside it.
According to manual: The vector assignments contains the (hard) assignments of the input data to the clusters.
I still can not understand what those numbers in the matrix assignments mean. I dont get it at all. Anyone mind helping me a bit here? An example or something would be great. What do these values represent anyway?
In k-means the problem you are trying to solve is the problem of clustering your 150 points into 20 clusters. Each point is a 64-dimension point and thus represented by a vector of size 64. So in your case dt is the set of points, each column is a 64-dim vector.
After running the algorithm you get centers and assignments. centers are the 20 positions of the cluster's center in a 64-dim space, in case you want to visualize it, measure distances between points and clusters, etc. 'assignments' on the other hand contains the actual assignments of each 64-dim point in dt. So if assignments[7] is 15 it indicates that the 7th vector in dt belongs to the 15th cluster.
For example here you can see clustering of lots of 2d points, let's say 1000 into 3 clusters. In this case dt would be 2x1000, centers would be 2x3 and assignments would be 1x1000 and will hold numbers ranging from 1 to 3 (or 0 to 2, in case you're using openCV)
EDIT:
The code to produce this image is located here: http://pypr.sourceforge.net/kmeans.html#k-means-example along with a tutorial on kmeans for pyPR.
In openCV it is the number of the cluster that each of the input points belong to

kmeans in MATLAB : number of clusters > number of rows?

I'm using the Statistics Toolbox function kmeans in MATLAB for the first time. I want to get the total euclidian distance to nearest centroid as an indicator of optimal k.
Here is my code :
clear all
N=10;
opts=statset('MaxIter',1000);
X=dlmread(['data.txt']);
crit=zeros(1,N);
for j=1:N
[a,b,c]=kmeans(X,j,'Start','cluster','EmptyAction','drop','Options',opts);
clear a b
crit(j)=sum(c);
end
save(['crit_',VF,'_',num2str(i),'_limswvl1.mat'],'crit')
Well everything should go well except that I get this error for j = 6 :
X must have more rows than the number of clusters.
I do not understand the problem since X has 54 rows, and no NaNs.
I tried using different EmptyAction options but it still won't work.
Any idea ? :)
The problem occurs since you use the cluster method to get initial centroids. From MATLAB documentation:
'cluster' - Perform preliminary clustering phase on random 10%
subsample of X. This preliminary phase is itself
initialized using 'sample'.
So when j=6, it tries to divide 10% of data into 6 clusters, i.e. 10% of 54 ~ 5. Therefore, you get the error X must have more rows than the number of clusters.
To get around this problem, either choose the points randomly (sample method) or choose points uniformly (uniform method).