In k-means clustering, is the marginal sum of squared distances decreasing? - cluster-analysis

Suppose I have a set of data, and let SSD(n) be the sum of squared distances when we assume n clusters. My question is the following: Is the marginal SSD always decreasing in n. In other words, is the function f(n) defined as
f(n)=SSD(n)-SSD(n+1)
decreasing in n. This would mean the benefit of adding each additional cluster is decreasing. I am trying to find either a proof or a simple counterexample.
I have done some simulations with random data, and it always seems to be true.

Related

MatLab:Generate N pseudo-random numbers with a Poisson distribution having mean M and total T where N,M, and T are user defined

I’d like to be able to generate in MatLab a sequence of N pseudo-random numbers with a Poisson distribution having mean M. The sum of the N numbers should be T. N, M, and T are always positive or zero and would be user specified parameters to any function.
Obviously, if T is small relative to N it is likely that there will be problems achieving a total of T. In that case the function could just return the values T and then N-1 zeros or an error code. However, it is highly likely that in most cases T>>N.
I have been trying variations based on the method of generating random numbers with a given distribution provided at http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution and trying various normalizations at each step but have not been successful.
You could try to approximate what you want by using multinomial distribution.
If you use Wikipedia notation, then k=N, n=T and pi=M/T. Poisson distribution has distinctive property of mean equal to variance, but if your parameters are such that pi is small, then mean npi would be quite close to variance npi(1-pi). Sum would be automatically (by property of multinomial) equal of T.
Multinomial sampling in Matlab is done using mnrmd function.
UPDATE
Wrt comment, lets consider N sampled values vi, and write their sum
Sum(i=1...N) vi = T
Lets compute mean value of the left and right side of this equation.
Sum(i=1...N) E(vi) = E(T) = T
On the right side, mean value of constant is constant itself. On the left side we have
Sum(i=1...N) E(vi) = Sum(i=1...N) M = N*M = T
Therefore, M=T/N and pi=M/T=1/N.

Clustering with a Distance Matrix via Mahalanobis distance

I have a set of pairwise distances (in a matrix) between objects that I would like to cluster. I currently use k-means clustering (computing distance from the centroid as the average distance to all members of the given cluster, since I do not have coordinates), with k chosen by the best Davies-Bouldin index over an interval.
However, I have three separate metrics (more in the future, potentially) describing the difference between the data, each fairly different in terms of magnitude and spread. Currently, I compute the distance matrix with the Euclidean distance across the three metrics, but I am fairly certain that the difference between the metrics is messing it up (e.g. the largest one is overpowering the other ones).
I thought a good way to deal with this is to use the Mahalanobis distance to combine the metrics. However, I obviously cannot compute the covariance matrix between the coordinates, but I can compute it for the distance metrics. Does this make sense? That is, if I get the distance between two objects i and j as:
D(i,j) = sqrt( dt S^-1 d )
where d is the 3-vector of the different distance metrics between i and j, dt is the transpose of d, and S is the covariance matrix of the distances, would D be a good, normalized metric for clustering?
I have also thought of normalizing the metrics (i.e. subtracting the mean and dividing out the variance) and then simply staying with the euclidean distance (in fact it would seem that this essentially is Mahalanobis distance, at least in some cases), or of switching to something like DBSCAN or EM, and have not ruled them out (though MDS then clustering might be a bit excessive). As a sidenote, any packages able to do all of this would be greatly appreciated. Thanks!
Consider using k-medoids (PAM) instead of a hacked k-means, which can work with arbitary distance functions; whereas k-means is designed to minimize variances, not arbitrary distances.
EM will have the same problem - it needs to be able to compute meaningful centers.
You can also use hierarchical linkage clustering. It only needs a distance matrix.

scipy kmeans iteration meaning?

I am using the kmeans2 algorithm from scipy to cluster pixel colors in an image to get the top average colors in the image.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html#scipy.cluster.vq.kmeans2
I am confused on the meaning for this parameter:
iter : int
Number of iterations of the k-means algrithm to run. Note that this differs in meaning from the iters parameter to the kmeans function.
If I want the kmeans algorithm to run until the clusters don't change, would I set the iter value high? Is there a way to find a best iter value?
The K-means algorithm works by initializing some K points and clustering your data by their distance from those points. Then it iterates by calculating the centroid of each cluster and redefining clusters by distance from the centroid. This isn't guaranteed to converge quickly, though it often does, so it's asking for a maximum iteration value.
edit: maximum iteration value. is incorrect I think, it is literally going to iterate iter times. The default 10 is a common iter value, though.
The higher the iter value the better the clustering. You could try running K-means on some of your data with various iter values and seeing where the time to compute for some gain in cluster quality is too high for your needs.

Minimization of L1-Regularized system, converging on non-minimum location?

This is my first post to stackoverflow, so if this isn't the correct area I apologize. I am working on minimizing a L1-Regularized System.
This weekend is my first dive into optimization, I have a basic linear system Y = X*B, X is an n-by-p matrix, B is a p-by-1 vector of model coefficients and Y is a n-by-1 output vector.
I am trying to find the model coefficients, I have implemented both gradient descent and coordinate descent algorithms to minimize the L1 Regularized system. To find my step size I am using the backtracking algorithm, I terminate the algorithm by looking at the norm-2 of the gradient and terminating if it is 'close enough' to zero(for now I'm using 0.001).
The function I am trying to minimize is the following (0.5)*(norm((Y - X*B),2)^2) + lambda*norm(B,1). (Note: By norm(Y,2) I mean the norm-2 value of the vector Y) My X matrix is 150-by-5 and is not sparse.
If I set the regularization parameter lambda to zero I should converge on the least squares solution, I can verify that both my algorithms do this pretty well and fairly quickly.
If I start to increase lambda my model coefficients all tend towards zero, this is what I expect, my algorithms never terminate though because the norm-2 of the gradient is always positive number. For example, a lambda of 1000 will give me coefficients in the 10^(-19) range but the norm2 of my gradient is ~1.5, this is after several thousand iterations, While my gradient values all converge to something in the 0 to 1 range, my step size becomes extremely small (10^(-37) range). If I let the algorithm run for longer the situation does not improve, it appears to have gotten stuck somehow.
Both my gradient and coordinate descent algorithms converge on the same point and give the same norm2(gradient) number for the termination condition. They also work quite well with lambda of 0. If I use a very small lambda(say 0.001) I get convergence, a lambda of 0.1 looks like it would converge if I ran it for an hour or two, a lambda any greater and the convergence rate is so small it's useless.
I had a few questions that I think might relate to the problem?
In calculating the gradient I am using a finite difference method (f(x+h) - f(x-h))/(2h)) with an h of 10^(-5). Any thoughts on this value of h?
Another thought was that at these very tiny steps it is traveling back and forth in a direction nearly orthogonal to the minimum, making the convergence rate so slow it is useless.
My last thought was that perhaps I should be using a different termination method, perhaps looking at the rate of convergence, if the convergence rate is extremely slow then terminate. Is this a common termination method?
The 1-norm isn't differentiable. This will cause fundamental problems with a lot of things, notably the termination test you chose; the gradient will change drastically around your minimum and fail to exist on a set of measure zero.
The termination test you really want will be along the lines of "there is a very short vector in the subgradient."
It is fairly easy to find the shortest vector in the subgradient of ||Ax-b||_2^2 + lambda ||x||_1. Choose, wisely, a tolerance eps and do the following steps:
Compute v = grad(||Ax-b||_2^2).
If x[i] < -eps, then subtract lambda from v[i]. If x[i] > eps, then add lambda to v[i]. If -eps <= x[i] <= eps, then add the number in [-lambda, lambda] to v[i] that minimises v[i].
You can do your termination test here, treating v as the gradient. I'd also recommend using v for the gradient when choosing where your next iterate should be.

Calculating an inverse matrix in Matlab

I'm running an optimization algorithm that requires calculation of the inverse of a matrix. The goal of the algorithm is to eliminate negative values from the matrix A and obtain the new matrix B. Basically, I start with known square matrices B and C of the same size.
I start by calculating the matrix A which is equal to:
A = B^-1 * C
Or in Matlab:
A = B\C;
I use this because Matlab told me B\C is more accurate than inv(B)*C.
The negative values in A are then divided by two and A is then normalised so that it's rows have length of 1. Using this new A, I calculate a new B with:
(1/N) * A * C' = B^-1
where N is just a scaling factor (# of columns in A). This new B would then be used again in the first step and these iterations continue until the negatives in A are gone.
My problem is I have to calculate B from the second equation and then normalise it.
invB = (1/N)*A*C';
B = inv(invB);
I've been calculating B using inv(B^-1) but after a few iterations I start getting messages that B^-1 is "close to singular or badly scaled."
This algorithm actually works for smaller matrices (around 70x70) but when it gets up to about 500x500 I start getting these messages.
Are there any better ways to calculate inv(B^-1)?
You should definitely head warnings about singular matrices. Results in numerical linear algebra tend to break down as you move toward matrices with high condition numbers. The underlying idea is if
A*b_1 = c
and we're actually solving the problem (because we are using approximate numbers when we use computers)
(A + matrix error)*b_2 = (c + vector error)
how close are b_1 and b_2 as a function of the matrix and vector errors? When A has small condition number b_1 and b_2 are close. When A has large condition number b_1 and b_2 are not close.
There is an informative piece of analysis you could do on your algorithm. At each iteration, after you've found B, find use Matlab to find the condition number of it. This is
cond(B)
You will likely see the number climb rapidly. This indicates that every time you iterate your algorithm, you should trust your result for B less and less.
Problems like this crop up all the time in numerical mathematics. If you'll be working with numerical algorithms frequently you should take some time to familiarize with the role of condition numbers in the field and preconditioning techniques as mentioned above. My preferred text for this is "Numerical Linear Algebra" by Lloyd Trefethen, but any text on Numerical Algebra should address some of these issues.
Best of luck,
Andrew
The main issue is that your matrix has a high condition number (i.e. really small rcond(B) in your case). This is due to the iterative structure in your algorithm, I guess. As you do each iteration your small singular values get smaller and smaller so your condition number grows exponentially. You should check preconditioning to avoid this kind of behavior.