Why doesn't k-means give the global minima? - cluster-analysis

I read that the k-means algorithm only converges to a local minima and not to a global minima. Why is this? I can logically think of how initialization could affect the final clustering and there is a possibility of sub-optimum clustering, but I did not find anything that will mathematically prove that.
Also, why is k-means an iterative process?
Can't we just partially differentiate the objective function w.r.t. to the centroids, equate it to zero to find the centroids that minimizes this function? Why do we have to use gradient descent to reach the minimum step by step?

Consider:
. c .
. c .
Where c is a cluster centroid. The algorithm will stop, but a better solution is:
. .
c c
. .
With regards to a proof - You don't require a mathematical proof to prove that something isn't always true, you just need a single counter-example, as provided above. You can probably convert the above into a mathematical proof, but this is unnecessary and generally requires a lot of work; even in academia it is accepted to merely give a counter-example to disprove something.
The k-means algorithm is by definition an iterative process, it's simply the way it works. The problem of clustering is NP-hard, thus using an exact algorithm to calculate the centroids would take immensely long.

Don't mix the problem and the algorithm.
The k-means problem is finding the least-squares assignment to centroids.
There are multiple algorithms for finding a solution.
There is an obvious approach to find the global optimum: enumerating all k^n possible assignments - that will yield a global minimum, but in exponential runtime.
Much more attention was put to finding an approximate solution in faster time.
The Lloyd/Forgy algorithm is an EM-style iterative model refinement approach, that is guaranteed to converge to a local minimum simply because there is a finite number of states, and the objective function must decrease in every step. This algorithm runs in O(n*k*i) where i << n usually, but it may find a local minimum only.
The MacQueens method is technically not iterative. It's a single-pass, one-element-at-a-time algorithm that will not even find a local minimum in the Lloyd sense. (You can however run it multiple passes over the data set, until convergence, to get a local minimum too!) If you do a single pass, its in O(n*k), for multiple passes add i. It may or may not take more passes than Lloyd.
Then there is Hartigan and Wong. I don't remember the details, IIRC it was a clever, more lazy, variant of Lloyd/Forgy, so probably in O(n*k*i), too (although probably not recomputing all n*k distances for later iterations?)
You could also do a randomized alogrithm that just tests l random assignments. It probably won't find a minimum at all, but run in "linear" time O(n*l).
Oh, and you can try different random initializations, to improve your chances of finding the global minimum. Add a factor t for the number of trials...

Related

Iteration in k means clustering

I am implementing k means clustering in tensorflow and have successfully made the function where we randomly select centroids from the sample points. Then these centroids are to be updated based on distance from sample points.
Is it always guaranteed that the more i iterate the better I get the cluster prediction or there is some point after which the predictions start getting wrong/anomalous??
Usually, K-means solving algorithm behaves as expected, in that it converges to a local minimum always. (I assume you're talking about the Lloyd/Florgy method) This is a statistical method used to find a local minima. It may stall at a saddle point where one of the dimensions is optimized but the others is not.
To abbreviate the rigorousness of the proof, it will always converge, albeit slowly due to many saddle points in your function.
There is no point in which your prediction gets more "wrong". It will be closer to the minima that you wanted, but the minima may not be the global. This may be your source of concern, because random initializations of K-means does not guarrantee this to happen.
One way to alleviate this is to actually run K-means on subgroups of your data, and then take those final points and average them to find a good initializer for your final clustering on the whole dataset.
Hope this helps.

Solving non-convex optimization with global optimization algorithm using MATLAB

I have a simple unconstrained non-convex optimization problem. Since problems of these type have multiple local minima, I am looking for global optimization algorithm that yields a unique/global minimum. In the internet I came across global optimization algorithms like genetic algorithms, simulated annealing, etc but for solving a simple one variable unconstrained non-convex optimization problem, I think using these high level algorithms doesn't seem to be a good idea. Could anyone recommend me a simple global algorithm for solving such simple one variable unconstrained non-convex optimization problem? I would highly appreciate ideas on this.
"Since problems of these type have multiple local minima". It's not true, the real situation is the following:
Maybe you have one local minimum
Maybe you have infinite set of local miminums
Maybe you have finite number of local minimums
Maybe minimum is not attained
Maybe problem is unbounded below
Also big picture is that there are really true methods which really solve problems (numerically and they slow), but there is a slang to call method which is not nessesary find minumum value of function also call as "solve".
In fact M^n~M for any finite n and any infinite set M. So the fact that you problem has one dimension is nothing. It is still hard as problem with 1000000 parameters which are drawn from the set M from theoretical point of view.
If you interesting how approximately solve problem with known precision epsilon in domain - then split you domain into 1/espsilon regions, sample value(evalute function) at middle point, and select minimum
Method which I will describe below is precise method, and other methods: particle estimation, sequent.convex.programming, alternative direction, particle swarm, Neidler-Mead simplex method, mutlistart gradient/subgradient descend or any descend algorithm like Newton Method or coordinate descend, they all has no gurantess for non-convex problems and some of them even can no be applied if function is nonconvex.
If you interesting in really solve with some precision on function value then then you can take attention into method, which is called branch-and-bound and which truly found minimum, algorithms which you described I don't think so that they solve problem and find minimum in strong sense:
Basic idea of branch and bound - partition domain into convex sets and improve lower/upper bound, in your case it is intervals.
You should have a routine to find upper bound of optimal (min.) value: you can do it e.g. just by sampling subdomain and take smallest or use local optimization method start from random point.
But also you should have lower bound of optimal (min.) value by some principle and this is hard part:
convex relaxation of integer variables to make them real variables
use Lagrange Dual function
use Lipshitc constant on function, etc.
This is sophisticaed step.
If this two values are near - we're done in other case partion or refine partition.
Get info about lower and upper bound of child subproblems and then take min. of upper bounds and min. of lower bounds of children. If child returns more worse lower bound it can be upgraded by parent.
References:
For more great explanation please look into:
EE364B, Lecture 18, prof. Stephen Boyd, Stanford University. It's available on youtube and in ITunes University. If you new to this area I recommend you to look EE263, EE364A, EE364B courses of Stephen P. Boyd. You will love it
Since this is a one dimensional problem, things are easier.
A simple steepest descend procedure may be used as follows.
Suppose the interval of search is a<x<b.
Start the SD from a minimizing your function say f(x). You recover the first minimum Xm1. You should use a fine step, not too large.
Shift this point by adding a positive small constant Xm1+ε. Then maximize f or minimize -f, starting from this point. You get a max of f, you distort it by ε and start from there a minimization, and so on so forth.

Who knows the computational complexity of the function quadprog in MATLAB?

The QP problem is convex. For Wiki, the problem can be solved in polynomial time.
But what exactly is the order?
That is an interesting question with (in my opinion) no clear answer. I am going to assume your problem is convex and you are interested in run-time complexity (as opposed to Iteration complexity).
As you may know, QuadProg is not one algorithm but rather, a generic name for something that solves Quadratic problems. It uses a set of algorithms underneath viz. Interior Point (Default), Trust-Region and Active-Set. Source.
Depending upon what you choose, each of these algorithms will have its own complexity analysis. For Trust-Region and Active-Set methods, the complexity analysis is extremely hard. In fact, Active-Set methods are not polynomial to begin with. Counterexamples exist where Active-Set methods take exponential "time" to converge (This is true also for the Simplex Method for Linear Programs). Source.
Now, assuming that you choose Interior Point methods, the answer is still not straightforward because there are various flavours of these methods. When Karmarkar first proposed this method, it was the first known polynomial algorithm for solving Linear Programs and it had a complexity of O(n^3.5). Source. These bounds were improved quite a lot later. However, this is for Linear Programs.
Finally, to answer your question, Ye and Tse proved in 1989 that we can have an Interior Point method with complexity O(n^3). However, whether MATLAB uses this exact flavor of Interior Point method is a little tricky to know but O(n^3) would be my best guess.
Of course, my answer is rather theoretical; if you want to empirically test it out, you can do so by gradually increasing the number of variables and plotting the CPU time required to get an estimate.

Naive bayes classifier calculation

I'm trying to use naive Bayes classifier to classify my dataset.My questions are:
1- Usually when we try to calculate the likehood we use the formula:
P(c|x)= P(c|x1) * P(c|x2)*...P(c|xn)*P(c) . But in some examples it says in order to avoid getting very small results we use P(c|x)= exp(log(c|x1) + log(c|x2)+...log(c|xn) + logP(c)). can anyone explain more to me the difference between these two formula and are they both used to calculate the "likehood" or the sec one is used to calculate something called "information gain".
2- In some cases when we try to classify our datasets some joints are null. Some ppl use "LAPLACE smoothing" technique in order to avoid null joints. Doesnt this technique influence on the accurancy of our classification?.
Thanks in advance for all your time. I'm just new to this algorithm and trying to learn more about it. So is there any recommended papers i should read? Thanks alot.
I'll take a stab at your first question, assuming you lost most of the P's in your second equation. I think the equation you are ultimately driving towards is:
log P(c|x) = log P(c|x1) + log P(c|x2) + ... + log P(c)
If so, the examples are pointing out that in many statistical calculations, it's often easier to work with the logarithm of a distribution function, as opposed to the distribution function itself.
Practically speaking, it's related to the fact that many statistical distributions involve an exponential function. For example, you can find where the maximum of a Gaussian distribution K*exp^(-s_0*(x-x_0)^2) occurs by solving the mathematically less complex problem (if we're going through the whole formal process of taking derivatives and finding equation roots) of finding where the maximum of its logarithm K-s_0*(x-x_0)^2 occurs.
This leads to many places where "take the logarithm of both sides" is a standard step in an optimization calculation.
Also, computationally, when you are optimizing likelihood functions that may involve many multiplicative terms, adding logarithms of small floating-point numbers is less likely to cause numerical problems than multiplying small floating point numbers together is.

maximum of a polynomial

I have a polynomial of order N (where N is even). This polynomial is equal to minus infinity for x minus/plus infinity (thus it has a maximum). What I am doing right now is taking the derivative of the polynomial by using polyder then finding the roots of the N-1 th order polynomial by using the roots function in Matlab which returns N-1 solutions. Then I am picking the real root that really maximizes the polynomial. The problem is that I am updating my polynomial a lot and at each time step I am using the above procedure to find the maximizer. Therefore, the roots function takes too much of a computation time making my application slow. Is there a way either in Matlab or a proposed algorithm that does this maximization in a computationally efficient fashion( i.e. just finding one solution instead of N-1 solutions)? Thanks.
Edit: I would also like to know whether there is a routine in Matlab that only returns the real roots instead of
roots which returns all real/complex ones.
I think that you are probably out of luck. If the coefficients of the polynomial change at every time step in an arbitrary fashion, then ultimately you are faced with a distinct and unrelated optimisation problem at every stage. There is insufficient information available to consider calculating just a subset of roots of the derivative polynomial - how could you know which derivative root provides the maximum stationary point of the polynomial without comparing the function value at ALL of the derivative roots?? If your polynomial coefficients were being perturbed at each step by only a (bounded) small amount or in a predictable manner, then it is conceivable that you would be able to try something iterative to refine the solution at each step (for example something crude such as using your previous roots as starting point of a new set of newton iterations to identify the updated derivative roots), but the question does not suggest that this is in fact the case so I am just guessing. I could be completely wrong here but you might just be out of luck in getting something faster unless you can provide more information of have some kind of relationship between the polynomials generated at each step.
There is a file exchange submission by Steve Morris which finds all real roots of functions on a given interval. It does so by interpolating the polynomial by a Chebychev polynomial, and finding its roots.
You can modify the eig evaluation of the companion matrix in there, to eigs. This allows you to find only one (or a few) roots and save time (there's a fair chance it's also possible to compute the roots or extrema of a Chebychev analytically, although I could not find a good reference for that (or even a bad one for that matter...)).
Another attempt that you can make in speeding things up, is to note that polyder does nothing more than
Pprime = (numel(P)-1:-1:1) .* P(1:end-1);
for your polynomial P. Also, roots does nothing more than find the eigenvalues of the companion matrix, so you could find these eigenvalues yourself, which prevents a call to roots. This could both be beneficial, because calls to non-builtin functions inside a loop prevent Matlab's JIT compiler from translating the loop to machine language. This could otherwise give you a large speed gain (factors of 100 or more are not uncommon).