K-means Stopping Criteria in Matlab? - matlab

Im implementing the k-means algorithm on matlab without using the k-means built-in function, The stopping criteria is when the new centroids doesn't change by new iterations, but i cannot implement it in matlab , can anybody help?
Thanks

Setting no change as a stopping criteria is a bad idea. There are a few main reasons you shouldn't use a 0 change condition
even for a well behaved function the difference between 0 change and a very small change (say 1e-5 perhaps)could be 1000+ iterations, so you are wasting time trying to get them to be exactly the same. Especially because computers usually keep far more digits than we are interested in. IF you only need 1 digit accuracy, why wait for the computer to find an answer within 1e-31?
computers have floating point errors everywhere. Try doing some easily reversible matrix operations like a = rand(3,3); b = a*a*inv(a); a-b theoretically this should be 0 but you will see it isn't. So these errors alone could prevent your program from ever stopping
dithering. lets say we have a 1d k means problem with 3 numbers and we want to split them into 2 groups. One iteration the grouping can be a,b vs c. the next iteration could be a vs b,c the next could be a,b vs c the next.... This is of course a simplified example, but there can be instances where a few data points can dither between clusters, and you will end up with a never ending algorithm. Since those few points are reassigned, the change will never be 0
the solution is to use a delta threshold. basically you subtract the current values from the previous and if they are less than a threshold you are done. This on its own is powerful, but as with any loop, you need a backup escape plan. And that is setting a max_iterations variable. Look at matlabs documentation for kmeans, even they have a MaxIter variable (default is 100) so even if your kmeans doesn't converge, at least it wont run endlessly. Something like this might work
%problem specific
max_iter = 100;
%choose a small number appropriate to your problem
thresh = 1e-3;
%ensures it runs the first time
delta_mu = thresh + 1;
num_iter = 0;
%do your kmeans in the loop
while (delta_mu > thresh && num_iter < max_iter)
%save these right away
old_mu = curr_mu;
%calculate new means and variances, this is the standard kmeans iteration
%then store the values in a variable called curr_mu
curr_mu = newly_calculate_values;
%use the two norm to find the delta as a single number. no matter what
%the original dimensionality of mu was. If old_mu -new_mu was
% 0 the norm is still 0. so it behaves well as a distance measure.
delta_mu = norm(old_mu - curr_mu,2);
num_ter = num_iter + 1;
end
edit
if you don't know the 2 norm is essentially the euclidean distance

Related

Iteration for convergence in Matlab without using a while loop

I have to iterate a process where I have an initial guess for the Mach number (M0). This initial guess will give me another guess for the Mach number by using two equations (Mn). Eventually, i want to iterate this process untill the error between M0 and Mn is small. I have the following piece of code and it actually works well with a while loop.
However, I am afraid that the while loop will take many iterations and computational time for certain inputs since this will be part of a bigger code which most likely will give unfeasible inputs for the while loop.
Therefore my question is the following. How can I iterate this process within Matlab without consulting a while loop? The code that I am implementing now is the following:
%% Input
gamma = 1.4;
theta = atan(0.315);
cpi = -0.732;
%% Loop
M0 = 0.2; %initial guess
Err = 100;
iterations = 0;
while Err > 0.5E-3
B = (1-(M0^2)*(1-M0*cpi))^0.5;
Mn = (((gamma+1)/2) * ((B+((1-cpi)^0.5)*sec(theta)-1)^2/(B^2 + (tan(theta))^2)) - ((gamma-1)/2) )^-0.5;
Err = abs(M0 - Mn);
M0 = Mn;
iterations=iterations+1;
end
disp(iterations) disp(Mn)
Many thanks
Since M0 is calculated in each iteration and you have trigonometric functions, you cannot use another way than iteration structures (i.e. while).
If you had a specific increase or decrease at M0, then you could initialize a vector of M0 and do vector calculations for B and Err.
But, with sec and tan this is not possible.
Another wat would be to use the parallel processing. But, since you change the M0 at each iteration then you cannot use the parfor loop.
As for a for loop, in MATLAB you need an array for for "command" argument (e.g. 1:10 or 1:length(x) or i = A, where A = 1:10 or A = [1:10;11:20]). Since you evaluate a condition and depending on the result of the evaluation you judge if you continue the execution or not, it seems that the while loop (or do while in another language) is the only way to go.
I think you need to clarify the issue. If it the issue you want to solve is that some inputs take a long time to calculate, it is not the while loop that takes the time, it is the execution of the code multiple times that causes it. Any method that loops through will be restricted by the time the block of code takes to execute multiplied by the number of iterations required to converge.
You can introduce something to stop at a certain number of iterationtions, conceptually:
While ((err > tolerance) && (numIterations < limit))
If you want an answer which does not require iterating over the code, this is akin to finding a closed form solution, and I suspect this does not exist.
Edit to add: by not exist I mean in a practical form which can be implemented in a more efficient way then iterating to a solution.

Details in sparse indexing

I have some code which uses sparse indexing (and there's no way that I can get around that). I run this in a function, and use it for two problems, where the sizes of all the variables involved do not change. However, for one problem, the sparse indexing part takes 5 seconds, and for the other, takes 25 seconds.
I checked the size of every variable involved, and they are the same for both problems. I also checked that xv is a full matrix for both problem types.
So, anyone else ever run into something weird like this? Any ideas as to why this would happen? Mainly I am trying to make the code more efficient, and while 5 seconds is ok for my particular application, 25 seconds (especially when I can't explain it) is very bad.
Edit: Here is a link to a photo that profiles this weird behavior. The runtime values were recorded on the third run to ensure that the size of X is also not changing. And I did check that xv is a dense (not sparse) matrix both times.
https://www.dropbox.com/s/i41j6afanzbjdyg/weird_bcd_thing.png?dl=0
Thanks so much for any help!
Code below (runs in a for loop). If I use ptype = 1, then it's 5 seconds, ptype = 3 is 25 seconds.
clvec = cliques{k};
xcurr = full(X(clvec));
xv = reshape(xcurr - Z(offset_index(k) + 1 : offset_index(k) + ncl^2),ncl,ncl);
%these two functions both take a dense symmetric matrix and return a dense symmetric matrix, and in both cases the size is the same for a given k.
if ptype == 1
xv = proj_PSD(xv,0,0);
elseif ptype == 3
xv = proj_Schoenberg(xv,0);
end
Xd = vec(xv) - xcurr;
%THIS IS THE WEIRD LINE
tic
X(clvec) = xv;
toc;
In the 'WEIRD LINE' : X(clvec) = xv;
You are using a random access to a sparse matrix.
This access in a sparse matrix is not constant and depends on its data. The time is may depend on the matrix values and the indices you are trying to access.
This is not the case in regular matrix, where you usually get a stable access time, and faster.
In order to assure a stable constant access try to change the implementation based on your specific matrix usage, try to avoid values assign by random access.
See next code for as a reference:
X = sparse(randi(100,50,1),randi(100,50,1),randn(1),100,100);
for i=1:10000
rand_inds{i} = randperm(10000,100);
end
for i=1:100
ti = tic;
X(rand_inds{i}) = 3;
to_X(i) = toc(ti);
end
Xf = full(X);
for i=1:100
ti = tic;
Xf(rand_inds{i}) = 3;
to_Xf(i) = toc(ti);
end
figure;plot(to_X);hold on;plot(to_Xf,'r');
I solved my problem! I'm posting the answer because I think it's interesting.
One thing I didn't mention in the question is that the loop goes from k = 1 to k = L, and for ptype = 3, we add one more step, and that's assigning all the diagonal indices to 0:
X(diag_index) = 0
where diag_index is computed ahead of time.
The problem is, instead of just assigning the values to 0, MATLAB will automatically discard these indices, and the next loop, when accessing diagonal indices, it has to re-allocate for X. So, I changed that line to
X(diag_index) = eps;
and now they both run equally fast! (It's not the best solution, since that's going to be a source of error later, but there's no more mystery!)
The answer is never what you think it would be...

K-Means centroids getting marginalized to having no data points [Matlab]

So I have a sort of strange problem. I have a dataset with 240 points and I'm trying to use k-means to cluster it into 100 clusters. I'm using Matlab but I don't have access to the statistics toolbox, so I had to write my own k-means function. It's pretty simple, so that shouldn't be too hard, right? Well, it seems something is wrong with my code:
function result=Kmeans(X,c)
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
old_label = zeros(1,N);
label = ones(1,N);
iter = 0;
while ~isequal(old_label, label)
old_label = label;
label = assign_labels(X, ctrs);
for i = 1:c
ctrs(i,:) = mean(X(label == i,:));
if sum(isnan(ctrs(i,:))) ~= 0
ctrs(i,:) = zeros(1,n);
end
end
iter = iter + 1;
end
result = ctrs;
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum((X - repmat(ctrs(i,:),[N,1])).^2,2);
end
[~,label] = min(dist,[],2);
It seems what happens is that when I go to recompute the centroids, some centroids have no datapoints assigned to them, so I'm not really sure what to do with that. After doing some research on this, I found that this can happen if you supply arbitrary initial centroids, but in this case the initial centroids are taken from the datapoints themselves, so this doesn't really make sense. I've tried re-assigning these centroids to random datapoints, but that causes the code to not converge (or at least after letting it run all night, the code never converged). Basically they get re-assigned, but that causes other centroids to get marginalized, and repeat. I'm not really sure what's wrong with my code, but I ran this same dataset through R's k-means function for k=100 for 1000 iterations and it managed to converge. Does anyone know what I'm messing up here? Thank you.
Let's step through your code one piece at a time and discuss what you're doing with respect to what I know about the k-means algorithm.
function result=Kmeans(X,c)
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
old_label = zeros(1,N);
label = ones(1,N);
This looks like a function that takes in a data matrix of size N x n, where N is the number of points you have in your dataset, while n is the dimension of a point in your dataset. This function also takes in c: the desired number of output clusters.index provides a random permutation between 1 to as many data points as you have, and then we select at random c points from this permutation which you have used to initialize your cluster centres.
iter = 0;
while ~isequal(old_label, label)
old_label = label;
label = assign_labels(X, ctrs);
for i = 1:c
ctrs(i,:) = mean(X(label == i,:));
if sum(isnan(ctrs(i,:))) ~= 0
ctrs(i,:) = zeros(1,n);
end
end
iter = iter + 1;
end
result = ctrs;
For k-means, we basically keep iterating until the cluster membership of each point from the previous iteration matches with the current iteration, which is what you have going with your while loop. Now, label determines the cluster membership of each point in your dataset. Now, for each cluster that exists, you determine what the mean data point is, then assign this mean data point as the new cluster centre for each cluster. For some reason, should you experience any NaN for any dimension of your cluster centre, you set your new cluster centre to all zeroes instead. This looks very abnormal to me, and I'll provide a suggestion later. Edit: Now I understand why you did this. This is because should you have any clusters that are empty, you would simply make this cluster centre all zeroes as you wouldn't be able to find the mean of empty clusters. This can be solved with my suggestion for duplicate initial clusters towards the end of this post.
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum((X - repmat(ctrs(i,:),[N,1])).^2,2);
end
[~,label] = min(dist,[],2);
This function takes in a dataset X and the current cluster centres for this iteration, and it should return a label list of where each point belongs to each cluster. This also looks correct because for each column of dist, you are calculating the distance between each point to each cluster, where those distances are in the ith column for the ith cluster. One optimization trick that I would use is to avoid using repmat here and use bsxfun which handles the replication internally. Therefore, do this instead:
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum(bsxfun(#minus, X, ctrs(i,:)).^2, 2);
end
[~,label] = min(dist,[],2);
Now, this all looks correct. I also ran some tests myself and it all seems to work out, provided that the initial cluster centres are unique. One small problem with k-means is that we implicitly assume that all cluster centres are unique. Should they not be unique, then you'll run into a problem where two clusters (or more) have the exact same initial cluster centres.... so which cluster should the data point be assigned to? When you're doing the min in your assign_labels function, should you have two identical cluster centres, the cluster label that the point gets assigned to will be the minimum of these two numbers. This is why you will have a cluster with no points in it, as all of the points that should have been assigned to this cluster get assigned to the other.
As such, you may have two (or more) initial cluster centres that are the same upon randomization. Even though the permutation of the indices to select are unique, the actual data points themselves may not be unique upon selection. One thing that I can impose is to loop over the permutation until you get a unique set of initial clusters without repeats. As such, try doing this at the beginning of your code instead.
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
while size(unique(ctrs, 'rows'), 1) ~= c
index=randperm(N);
ctrs = X(index(1:c),:);
end
old_label = zeros(1,N);
label = ones(1,N);
iter = 0;
%// While loop appears here
This will ensure that you have a unique set of initial clusters before you continue on in your code. Now, going back to your NaN stuff inside the for loop. I honestly don't see how any dimension could result in NaN after you compute the mean if your data doesn't have any NaN to begin with. I would suggest you get rid of this in your code as (to me) it doesn't look very useful. Edit: You can now remove the NaN check as the initial cluster centres should now be unique.
This should hopefully fix your problems you're experiencing. Good luck!
"Losing" a cluster is not half as special as one may think, due to the nature of k-means.
Consider duplicates. Lets assume that all your first k points are identical, what would happen in your code? There is a reason you need to carefully handle this case. The simplest solution would be to leave the centroid as it was before, and live with degenerate clusters.
Given that you only have 240 points, but want to use k=100, don't expect too good results. Most objects will be on their own... choosing a much too large k is probably a reason why you do see this degeneration effect a lot. Let's assume out of these 240, fewer than 100 are unique... Then you cannot have 100 non-empty clusters... Plus, I would consider this kind of result "overfitting", anyway.
If you don't have the toolboxes you need in Matlab, maybe you should move on to free software. Octave, R, Weka, ELKI, ... there is plenty of software, some of which is much more powerful when it comes to clustering than pure Matlab (in particular, if you don't have the toolboxes).
Also benchmark. You will be surprised of the performance differences.

Matlab vectorization of multiple embedded for loops

Suppose you have 5 vectors: v_1, v_2, v_3, v_4 and v_5. These vectors each contain a range of values from a minimum to a maximum. So for example:
v_1 = minimum_value:step:maximum_value;
Each of these vectors uses the same step size but has a different minimum and maximum value. Thus they are each of a different length.
A function F(v_1, v_2, v_3, v_4, v_5) is dependant on these vectors and can use any combination of the elements within them. (Apologies for the poor explanation). I am trying to find the maximum value of F and record the values which resulted in it. My current approach has been to use multiple embedded for loops as shown to work out the function for every combination of the vectors elements:
% Set the temp value to a small value
temp = 0;
% For every combination of the five vectors use the equation. If the result
% is greater than the one calculated previously, store it along with the values
% (postitions) of elements within the vectors
for a=1:length(v_1)
for b=1:length(v_2)
for c=1:length(v_3)
for d=1:length(v_4)
for e=1:length(v_5)
% The function is a combination of trigonometrics, summations,
% multiplications etc..
Result = F(v_1(a), v_2(b), v_3(c), v_4(d), v_5(e))
% If the value of Result is greater that the previous value,
% store it and record the values of 'a','b','c','d' and 'e'
if Result > temp;
temp = Result;
f = a;
g = b;
h = c;
i = d;
j = e;
end
end
end
end
end
end
This gets incredibly slow, for small step sizes. If there are around 100 elements in each vector the number of combinations is around 100*100*100*100*100. This is a problem as I need small step values to get a suitably converged answer.
I was wondering if it was possible to speed this up using Vectorization, or any other method. I was also looking at generating the combinations prior to the calculation but this seemed even slower than my current method. I haven't used Matlab for a long time but just looking at the number of embedded for loops makes me think that this can definitely be sped up. Thank you for the suggestions.
No matter how you generate your parameter combination, you will end up calling your function F 100^5 times. The easiest solution would be to use parfor instead in order to exploit multi-core calculation. If you do that, you should store the calculation results and find the maximum after the loop, because your current approach would not be thread-safe.
Having said that and not knowing anything about your actual problem, I would advise you to implement a more structured approach, like first finding a coarse solution with a bigger step size and narrowing it down successivley by reducing the min/max values of your parameter intervals. What you have currently is the absolute brute-force method which will never be very effective.

Jacobi iteration doesn't end

I'm trying to implement the Jacobi iteration in MATLAB but am unable to get it to converge. I have looked online and elsewhere for working code for comparison but am unable to find any that is something similar to my code and still works. Here is what I have:
function x = Jacobi(A,b,tol,maxiter)
n = size(A,1);
xp = zeros(n,1);
x = zeros(n,1);
k=0; % number of steps
while(k<=maxiter)
k=k+1;
for i=1:n
xp(i) = 1/A(i,i)*(b(i) - A(i,1:i-1)*x(1:i-1) - A(i,i+1:n)*x(i+1:n));
end
err = norm(A*xp-b);
if(err<tol)
x=xp;
break;
end
x=xp;
end
This just blows up no matter what A and b I use. It's probably a small error I'm overlooking but I would be very grateful if anyone could explain what's wrong because this should be correct but is not so in practice.
Your code is correct. The reason why it may not seem to work is because you are specifying systems that may not converge when you are using Jacobi iterations.
To be specific (thanks to #Saraubh), this method will converge if your matrix A is strictly diagonally dominant. In other words, for each row i in your matrix, the absolute summation of all of the columns j at row i without the diagonal coefficient at i must be less than the diagonal itself. In other words:
However, there are some systems that will converge with Jacobi, even if this condition isn't satisfied, but you should use this as a general rule before trying to use Jacobi for your system. It's actually more stable if you use Gauss-Seidel. The only difference is that you are re-using the solution of x and feeding it into the other variables as you progress down the rows. To make this Gauss-Seidel, all you have to do is change one character within your for loop. Change it from this:
xp(i) = 1/A(i,i)*(b(i) - A(i,1:i-1)*x(1:i-1) - A(i,i+1:n)*x(i+1:n));
To this:
xp(i) = 1/A(i,i)*(b(i) - A(i,1:i-1)*xp(1:i-1) - A(i,i+1:n)*x(i+1:n));
**HERE**
Here are two examples that I will show you:
Where we specify a system that does not converge by Jacobi, but there is a solution. This system is not diagonally dominant.
Where we specify a system that does converge by Jacobi. Specifically, this system is diagonally dominant.
Example #1
A = [1 2 2 3; -1 4 2 7; 3 1 6 0; 1 0 3 4];
b = [0;1;-1;2];
x = Jacobi(A, b, 0.001, 40)
xtrue = A \ b
x =
1.0e+09 *
4.1567
0.8382
1.2380
1.0983
xtrue =
-0.1979
-0.7187
0.0521
0.5104
Now, if I used the Gauss-Seidel solution, this is what I get:
x =
-0.1988
-0.7190
0.0526
0.5103
Woah! It converged for Gauss-Seidel and not Jacobi, even though the system isn't diagonally dominant, I may have an explanation for that, and I'll provide later.
Example #2
A = [10 -1 2 0; -1 -11 -1 3; 2 -1 10 -1; 0 3 -1 8];
b = [6;25;-11;15];
x = Jacobi(A, b, 0.001, 40);
xtrue = A \ b
x =
0.6729
-1.5936
-1.1612
2.3275
xtrue =
0.6729
-1.5936
-1.1612
2.3274
This is what I get with Gauss-Seidel:
x =
0.6729
-1.5936
-1.1612
2.3274
This certainly converged for both, and the system is diagonally dominant.
As such, there is nothing wrong with your code. You are just specifying a system that can't be solved using Jacobi. It's better to use Gauss-Seidel for iterative methods that revolve around this kind of solving. The reason why is because you are immediately using information from the current iteration and spreading this to the rest of the variables. Jacobi does not do this, which is the reason why it diverges more quickly. For Jacobi, you can see that Example #1 failed to converge, while Example #2 did. Gauss-Seidel converged for both. In fact, when they both converge, they're quite close to the true solution.
Again, you need to make sure that your systems are diagonally dominant so you are guaranteed to have convergence. Not enforcing this rule... well... you'll be taking a risk as it may or may not converge.
Good luck!
Though this does not point out the problem in your code, I believe that you are looking for the Numerical Methods: Jacobi File Exchange Submission.
%JACOBI Jacobi iteration for solving a linear system.
% Sample call
% [X,dX] = jacobi(A,B,P,delta,max1)
% [X,dX,Z] = jacobi(A,B,P,delta,max1)
It seems to do exactly what you describe.
As others have pointed out that not all systems are convergent using Jacobi method, but they do not point out why? Actually only a small sub-set of systems converge with Jacobi method.
The convergence criteria is that the "sum of all the coefficients (non-diagonal) in a row" must be lesser than the "coefficient at the diagonal position in that row". This criteria must be satisfied by all the rows. You can read more at: Jacobi Method Convergence
Before you decide to use Jacobi method, you must see whether this criteria is satisfied by the numerical method or not. The Gauss-Seidel method has a slightly more relaxed convergence criteria which allows you to use it for most of the Finite Difference type numerical methods.