K-Means centroids getting marginalized to having no data points [Matlab] - matlab

So I have a sort of strange problem. I have a dataset with 240 points and I'm trying to use k-means to cluster it into 100 clusters. I'm using Matlab but I don't have access to the statistics toolbox, so I had to write my own k-means function. It's pretty simple, so that shouldn't be too hard, right? Well, it seems something is wrong with my code:
function result=Kmeans(X,c)
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
old_label = zeros(1,N);
label = ones(1,N);
iter = 0;
while ~isequal(old_label, label)
old_label = label;
label = assign_labels(X, ctrs);
for i = 1:c
ctrs(i,:) = mean(X(label == i,:));
if sum(isnan(ctrs(i,:))) ~= 0
ctrs(i,:) = zeros(1,n);
end
end
iter = iter + 1;
end
result = ctrs;
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum((X - repmat(ctrs(i,:),[N,1])).^2,2);
end
[~,label] = min(dist,[],2);
It seems what happens is that when I go to recompute the centroids, some centroids have no datapoints assigned to them, so I'm not really sure what to do with that. After doing some research on this, I found that this can happen if you supply arbitrary initial centroids, but in this case the initial centroids are taken from the datapoints themselves, so this doesn't really make sense. I've tried re-assigning these centroids to random datapoints, but that causes the code to not converge (or at least after letting it run all night, the code never converged). Basically they get re-assigned, but that causes other centroids to get marginalized, and repeat. I'm not really sure what's wrong with my code, but I ran this same dataset through R's k-means function for k=100 for 1000 iterations and it managed to converge. Does anyone know what I'm messing up here? Thank you.

Let's step through your code one piece at a time and discuss what you're doing with respect to what I know about the k-means algorithm.
function result=Kmeans(X,c)
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
old_label = zeros(1,N);
label = ones(1,N);
This looks like a function that takes in a data matrix of size N x n, where N is the number of points you have in your dataset, while n is the dimension of a point in your dataset. This function also takes in c: the desired number of output clusters.index provides a random permutation between 1 to as many data points as you have, and then we select at random c points from this permutation which you have used to initialize your cluster centres.
iter = 0;
while ~isequal(old_label, label)
old_label = label;
label = assign_labels(X, ctrs);
for i = 1:c
ctrs(i,:) = mean(X(label == i,:));
if sum(isnan(ctrs(i,:))) ~= 0
ctrs(i,:) = zeros(1,n);
end
end
iter = iter + 1;
end
result = ctrs;
For k-means, we basically keep iterating until the cluster membership of each point from the previous iteration matches with the current iteration, which is what you have going with your while loop. Now, label determines the cluster membership of each point in your dataset. Now, for each cluster that exists, you determine what the mean data point is, then assign this mean data point as the new cluster centre for each cluster. For some reason, should you experience any NaN for any dimension of your cluster centre, you set your new cluster centre to all zeroes instead. This looks very abnormal to me, and I'll provide a suggestion later. Edit: Now I understand why you did this. This is because should you have any clusters that are empty, you would simply make this cluster centre all zeroes as you wouldn't be able to find the mean of empty clusters. This can be solved with my suggestion for duplicate initial clusters towards the end of this post.
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum((X - repmat(ctrs(i,:),[N,1])).^2,2);
end
[~,label] = min(dist,[],2);
This function takes in a dataset X and the current cluster centres for this iteration, and it should return a label list of where each point belongs to each cluster. This also looks correct because for each column of dist, you are calculating the distance between each point to each cluster, where those distances are in the ith column for the ith cluster. One optimization trick that I would use is to avoid using repmat here and use bsxfun which handles the replication internally. Therefore, do this instead:
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum(bsxfun(#minus, X, ctrs(i,:)).^2, 2);
end
[~,label] = min(dist,[],2);
Now, this all looks correct. I also ran some tests myself and it all seems to work out, provided that the initial cluster centres are unique. One small problem with k-means is that we implicitly assume that all cluster centres are unique. Should they not be unique, then you'll run into a problem where two clusters (or more) have the exact same initial cluster centres.... so which cluster should the data point be assigned to? When you're doing the min in your assign_labels function, should you have two identical cluster centres, the cluster label that the point gets assigned to will be the minimum of these two numbers. This is why you will have a cluster with no points in it, as all of the points that should have been assigned to this cluster get assigned to the other.
As such, you may have two (or more) initial cluster centres that are the same upon randomization. Even though the permutation of the indices to select are unique, the actual data points themselves may not be unique upon selection. One thing that I can impose is to loop over the permutation until you get a unique set of initial clusters without repeats. As such, try doing this at the beginning of your code instead.
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
while size(unique(ctrs, 'rows'), 1) ~= c
index=randperm(N);
ctrs = X(index(1:c),:);
end
old_label = zeros(1,N);
label = ones(1,N);
iter = 0;
%// While loop appears here
This will ensure that you have a unique set of initial clusters before you continue on in your code. Now, going back to your NaN stuff inside the for loop. I honestly don't see how any dimension could result in NaN after you compute the mean if your data doesn't have any NaN to begin with. I would suggest you get rid of this in your code as (to me) it doesn't look very useful. Edit: You can now remove the NaN check as the initial cluster centres should now be unique.
This should hopefully fix your problems you're experiencing. Good luck!

"Losing" a cluster is not half as special as one may think, due to the nature of k-means.
Consider duplicates. Lets assume that all your first k points are identical, what would happen in your code? There is a reason you need to carefully handle this case. The simplest solution would be to leave the centroid as it was before, and live with degenerate clusters.
Given that you only have 240 points, but want to use k=100, don't expect too good results. Most objects will be on their own... choosing a much too large k is probably a reason why you do see this degeneration effect a lot. Let's assume out of these 240, fewer than 100 are unique... Then you cannot have 100 non-empty clusters... Plus, I would consider this kind of result "overfitting", anyway.
If you don't have the toolboxes you need in Matlab, maybe you should move on to free software. Octave, R, Weka, ELKI, ... there is plenty of software, some of which is much more powerful when it comes to clustering than pure Matlab (in particular, if you don't have the toolboxes).
Also benchmark. You will be surprised of the performance differences.

Related

Encode each training image as a histogram of the number of times each vocabulary element shows up for Bag of Visual Words

I want to implement bag of visual words in MATLAB. I used SURF features to extract features from the images and k-means to cluster those features into k clusters. I now have k centroids and I want to know how many times each cluster is used by assigning each image feature to its closet neighbor. Finally, I'd like to create a histogram of this for each image.
I tried to use knnsearch function but it doesn't work in this case.
Here is my MATLAB code:
clc;
clear;
close all;
folder = 'CarData/TrainImages/cars';
filePattern = fullfile(folder, '*.pgm');
f=dir(filePattern);
files={f.name};
for k=1:numel(files)
fullFileName = fullfile(folder, files{k});
H = fspecial('log');
image=imfilter(imread(fullFileName),H);
temp = detectSURFFeatures(image);
[im_features, temp] = extractFeatures(image, temp);
features{k}= im_features;
end
features = vertcat(features{:});
image_feats = [];
[assignments,centers] = kmeans(double(features),500);
vocab = centers';
I have all images feature in features array and cluster center in centroid array
You're almost there. You don't even need to use knnsearch at all. The assignments variable tells you which input feature mapped to which cluster. assignments will give you a N x 1 vector where N is the total number of examples you have, or the total number of features in the input matrix features. Each value assignments(i) tells you which cluster the example i (or row i) of features it maps to. The cluster centroid dictated by assignments(i) would be given as centers(i, :).
Therefore given how you've called kmeans, it will be a N x 1 vector where each element is from 1 to 500 with 500 being the total number of clusters desired.
Let's do the simple case where we only have one image in your codebook. If this is the case, all you have to do is create a histogram of the assignments variable. The output histogram h will be a 500 x 1 vector with each element h(i) being the number of times an example used centroid i as its representation in your codebook.
Just use the histcounts function and make sure that you specify the bin ranges so that they coincide with each cluster ID. You must make sure that you account for the ending bin, as the bin ranges are exclusive on the right edge so just add an additional bin to the end.
Something like this will work:
h = histcounts(assignments, 1 : 501);
If you want something simpler and you don't want to worry about specifying the end bin, you can use accumarray to achieve the same result:
h = accumarray(assignments, 1);
The effect of accumarray we assign key-value pairs where the key is the centroid that the example mapped to and the value is simply 1 for all keys. accumarray will bin all values in assignments that share the same key and you do something with those values. The default behaviour of accumarray is to sum all values, which is effectively computing the histogram.
However, you want to do this for multiple images, not just a single image.
For Bag of Visual Words problems, we will certainly have more than one training image in our database. Therefore, you want to find the histogram of the features for each image. We can still use the above concept, but one thing I can suggest is you maintain a separate variable that tells you how many features were detected per image, then you can index into the assignments variable to help extract out the correct assigned centroid IDs, then build a histogram of those individually. We can build a 2D matrix where each row delineates the histogram of each image. Remember that in kmeans, each row tells you what cluster each example was assigned to independently of the other examples in your data. Using that, you would use kmeans on the entire training dataset, then be smart about how you're accessing the assignments variable to extract out the assigned clusters for each input image.
Therefore, modify your code so that it looks something like this:
clc;
clear;
close all;
folder = 'CarData/TrainImages/cars';
filePattern = fullfile(folder, '*.pgm');
f=dir(filePattern);
files={f.name};
num_features = zeros(numel(files), 1); % New - for keeping track of # of features per image
for k=1:numel(files)
fullFileName = fullfile(folder, files{k});
H = fspecial('log');
image=imfilter(imread(fullFileName),H);
temp = detectSURFFeatures(image);
[im_features, temp] = extractFeatures(image, temp);
num_features(k) = size(im_features, 1); % New - # of features per image
features{k}= im_features;
end
features = vertcat(features{:});
num_clusters = 500; % Added to make the code adaptive
[assignments,centers] = kmeans(double(features), num_clusters);
counter = 1; % Keeps track of where we need to slice in assignments
% Go through each image and find their histograms
features_hist = zeros(numel(files), num_clusters); % Records the per image histograms
for k = 1 : numel(files)
a = assignments(counter : counter + num_features(k) - 1); % Get the assignments
h = histcounts(a, 1 : num_clusters + 1);
% Or:
% h = accumarray(a, 1).'; % Transpose to make it a row
% Place in final output
features_hist(k, :) = h;
% Increment counter
counter = counter + num_features(k);
end
features_hist will now be a N x 500 matrix where each row is the histogram of each image you are seeking. The final job would be to use a supervised machine learning algorithm (SVM, Neural Networks, etc.) where the expected labels is the description of each image you have assigned to the image accompanied by the histogram of each image as the input features. The final result would be a learned model so that when you have a new image, calculate the SURF features, represent them in a histogram of features like we did above, then feed it into the classification model to give you the expected class or label that the image represents.
P.S. Deep Learning / CNNs do a much better job at this, but require much more time to train. If you're looking at performance wise, don't use Bag of Visual Words but this is something very quick to implement and it's known to perform moderately well but that of course depends on the kinds of images you want to classify.

Matlab: How to customize a clustering code to be a multistage clustering?

I want to cluster a huge amount of data records. The data that I'm dealing with are of string type. The clustering process takes a long time.
Let us assume that I want to cluster a set of email data records into cluster, where emails written by the same person are allocated to the same cluster (taking into account that a person might write his/her name in different ways).
I want to perform a multi stage clustering:
First stage clustering based on name, if the name distance between two records is less than a threshold we consider these clusters otherwise...
The data records enters the second stage of clustering based on other attributes (other than name).
The pairwise distance is calculated. Now I'm in the clustering phase. I want to use the following code for dbscan clustering:
function [IDX, isnoise] = dbscan_strings(X,epsilon,MinPts)
C = 0;
n = size(X,1);
IDX = zeros(n,1);
D = pdist2(X,X,#intersection);
visited = false(n,1);
isnoise = false(n,1);
for i = 1:n
if ~visited(i)
visited(i) = true;
Neighbors = RegionQuery(i);
if numel(Neighbors)<MinPts
% X(i,:) is NOISE
isnoise(i) = true;
else
C = C+1;
ExpandCluster(i,Neighbors,C);
end
end
end
function ExpandCluster(i,Neighbors,C)
IDX(i) = C;
k = 1;
while true
j = Neighbors(k);
if ~visited(j)
visited(j) = true;
Neighbors2 = RegionQuery(j);
if numel(Neighbors2)>=MinPts
Neighbors = [Neighbors Neighbors2]; %#ok
end
end
if IDX(j)==0
IDX(j) = C;
end
k = k + 1;
if k > numel(Neighbors)
break;
end
end
end
function Neighbors = RegionQuery(i)
Neighbors = find(D(i,:)<=epsilon);
end
end
I need help in making the following clustering process into a multistage process where X contains data records with all attributes. Let us assume that X{:,1} is the data records with the name attribute, since the name is contained in the first column.
NOTE: I will give a bounty of 50 points for the one who helps me.
Don't do everything at once!
You are computing a lot of things that you never need, which makes things slow. For example, a good DBSCAN does not use a distance function, but an index.
For the names, only work on unique names! You supposedly have many exact same names, but you end up computing the same distances again and again.
So first of all, build a set of unique names only. Perform your similarity matching on this (I would however suggest to use OpenRefine for this rather than Matlab!).
Once you have identified names to merge, build a new data matrix for every name group. Then run whatever clustering you want. Good candidates are probably HDBSCAN, and OPTICSXi (have a look at the clustering algorithms available in ELKI, which probably has the widest selection to choose from). Maybe start only with an average common name, to get a feeling of the parameters for the algorithm. Don't cluster all subsets at once.

Merge close centroids

suppose I want to cluster data with 3 features. After running a clustering algorithm as a result I got the following 6 cluster centers:
246.844727524039 250.149069392025 94.0942587475951
121.988259016632 162.247917376091 100.033277638728
246.832071340390 250.114555535282 94.0640197467370
247.069762690783 237.380529249185 176.069941183101
57.6643682370364 59.8647220036974 44.0150398556124
253.248727658092 254.655572229735 71.2948414962619
Anyone can notice that centers 1 and 3 are very close to each other. Is there a way to merge them as one center? I'm looking something like a function that returns the merged cluster centers. Any ideas?
I suggest the following approach:
define a threshold which represents the minimal possible Eucledean distance between two centroids.
iterate over all the possible pairs, and if some their distance is lower then the thresohld - unite them.
You can perform this calculation as follows:
[m,n] = size(centers);
threshold = 1; %defines a threshold
centroidsToMerge = [];
for i=1:m
for j=(i+1):m
if norm(centers(i,:)-centers(j,:))<threshold
centroidsToMerge = [centroidsToMerge;[i,j]];
end
end
end
results for threshold=1:
centroidsToMerge = [1, 3]
results for threshold=30:
centroidsToMerge = [ 1,3 ; 1,6 ; 3,6 ]
If you have the Statistics and Machine Learning Toolbox you can use MATLAB's pdist function in order to calculate all the pair distanced automatically, and thus maybe avoiding the for loops. Unfortunately, I don't have this toolbox at the moment so I wasn't able to use it. However, I still believe that it is a good way to start with.

K-means Stopping Criteria in Matlab?

Im implementing the k-means algorithm on matlab without using the k-means built-in function, The stopping criteria is when the new centroids doesn't change by new iterations, but i cannot implement it in matlab , can anybody help?
Thanks
Setting no change as a stopping criteria is a bad idea. There are a few main reasons you shouldn't use a 0 change condition
even for a well behaved function the difference between 0 change and a very small change (say 1e-5 perhaps)could be 1000+ iterations, so you are wasting time trying to get them to be exactly the same. Especially because computers usually keep far more digits than we are interested in. IF you only need 1 digit accuracy, why wait for the computer to find an answer within 1e-31?
computers have floating point errors everywhere. Try doing some easily reversible matrix operations like a = rand(3,3); b = a*a*inv(a); a-b theoretically this should be 0 but you will see it isn't. So these errors alone could prevent your program from ever stopping
dithering. lets say we have a 1d k means problem with 3 numbers and we want to split them into 2 groups. One iteration the grouping can be a,b vs c. the next iteration could be a vs b,c the next could be a,b vs c the next.... This is of course a simplified example, but there can be instances where a few data points can dither between clusters, and you will end up with a never ending algorithm. Since those few points are reassigned, the change will never be 0
the solution is to use a delta threshold. basically you subtract the current values from the previous and if they are less than a threshold you are done. This on its own is powerful, but as with any loop, you need a backup escape plan. And that is setting a max_iterations variable. Look at matlabs documentation for kmeans, even they have a MaxIter variable (default is 100) so even if your kmeans doesn't converge, at least it wont run endlessly. Something like this might work
%problem specific
max_iter = 100;
%choose a small number appropriate to your problem
thresh = 1e-3;
%ensures it runs the first time
delta_mu = thresh + 1;
num_iter = 0;
%do your kmeans in the loop
while (delta_mu > thresh && num_iter < max_iter)
%save these right away
old_mu = curr_mu;
%calculate new means and variances, this is the standard kmeans iteration
%then store the values in a variable called curr_mu
curr_mu = newly_calculate_values;
%use the two norm to find the delta as a single number. no matter what
%the original dimensionality of mu was. If old_mu -new_mu was
% 0 the norm is still 0. so it behaves well as a distance measure.
delta_mu = norm(old_mu - curr_mu,2);
num_ter = num_iter + 1;
end
edit
if you don't know the 2 norm is essentially the euclidean distance

Details in sparse indexing

I have some code which uses sparse indexing (and there's no way that I can get around that). I run this in a function, and use it for two problems, where the sizes of all the variables involved do not change. However, for one problem, the sparse indexing part takes 5 seconds, and for the other, takes 25 seconds.
I checked the size of every variable involved, and they are the same for both problems. I also checked that xv is a full matrix for both problem types.
So, anyone else ever run into something weird like this? Any ideas as to why this would happen? Mainly I am trying to make the code more efficient, and while 5 seconds is ok for my particular application, 25 seconds (especially when I can't explain it) is very bad.
Edit: Here is a link to a photo that profiles this weird behavior. The runtime values were recorded on the third run to ensure that the size of X is also not changing. And I did check that xv is a dense (not sparse) matrix both times.
https://www.dropbox.com/s/i41j6afanzbjdyg/weird_bcd_thing.png?dl=0
Thanks so much for any help!
Code below (runs in a for loop). If I use ptype = 1, then it's 5 seconds, ptype = 3 is 25 seconds.
clvec = cliques{k};
xcurr = full(X(clvec));
xv = reshape(xcurr - Z(offset_index(k) + 1 : offset_index(k) + ncl^2),ncl,ncl);
%these two functions both take a dense symmetric matrix and return a dense symmetric matrix, and in both cases the size is the same for a given k.
if ptype == 1
xv = proj_PSD(xv,0,0);
elseif ptype == 3
xv = proj_Schoenberg(xv,0);
end
Xd = vec(xv) - xcurr;
%THIS IS THE WEIRD LINE
tic
X(clvec) = xv;
toc;
In the 'WEIRD LINE' : X(clvec) = xv;
You are using a random access to a sparse matrix.
This access in a sparse matrix is not constant and depends on its data. The time is may depend on the matrix values and the indices you are trying to access.
This is not the case in regular matrix, where you usually get a stable access time, and faster.
In order to assure a stable constant access try to change the implementation based on your specific matrix usage, try to avoid values assign by random access.
See next code for as a reference:
X = sparse(randi(100,50,1),randi(100,50,1),randn(1),100,100);
for i=1:10000
rand_inds{i} = randperm(10000,100);
end
for i=1:100
ti = tic;
X(rand_inds{i}) = 3;
to_X(i) = toc(ti);
end
Xf = full(X);
for i=1:100
ti = tic;
Xf(rand_inds{i}) = 3;
to_Xf(i) = toc(ti);
end
figure;plot(to_X);hold on;plot(to_Xf,'r');
I solved my problem! I'm posting the answer because I think it's interesting.
One thing I didn't mention in the question is that the loop goes from k = 1 to k = L, and for ptype = 3, we add one more step, and that's assigning all the diagonal indices to 0:
X(diag_index) = 0
where diag_index is computed ahead of time.
The problem is, instead of just assigning the values to 0, MATLAB will automatically discard these indices, and the next loop, when accessing diagonal indices, it has to re-allocate for X. So, I changed that line to
X(diag_index) = eps;
and now they both run equally fast! (It's not the best solution, since that's going to be a source of error later, but there's no more mystery!)
The answer is never what you think it would be...