Can Kmeans total within sum of squares increase with number of clusters?

Can Kmeans total within sum of squares increase with number of clusters? - cluster-analysis

I am seeing an increase in total within Sum of squares when I am using below code.Is this even possible or I am doing some mistake in code?
v<-foreach(i = 1:30,.combine = c) %dopar% {
iter <- kmeans (clustering_data,centers = i,iter.max = 1000)
iter$tot.withinss
}

K-means is a randomized algorithm. It does not guarantee to find the optimum.
So you simply had a bad random.

Yes. See Anony-Mousse's answer.
If you used the nstart = 25 argument of the kmeans() function, you would run the algorithm 25 times, let R collect the error measures from each run, and build averages internally. This way you do not need to construct a foreach-loop.
from the documentation of R's kmeans()
## random starts do help here with too many clusters
## (and are often recommended anyway!):
(cl <- kmeans(x, 5, nstart = 25))
You have to choose a reasonable value for nstart. Then, errors by different random initializations are more likely to have averaged out. (But there is no guarantee that tot.withinss is minimal after nstart runs. )

Related

Efficient way of computing second min value

Given a matrix, it's easy to compute the value and index of the min value:
A = rand(10);
[value, index] = min(A(:));
However I would also like to recover the second min value (idem for max).
I can of course take any of this two approaches:
Converting A to a vector and sorting it.
PROS: I can then recover the second, third... n minimum value
CONS: If A is large, sorting is expensive
Once the min location of A is located, I can replace this value by a large one (eg: Inf) and then run min again.
PROS: Cheaper than sort
CONS: I must modify my matrix (and save the modified value in an aux variable). Also re-running min is costly on a large matrix.
I'm wondering if there is a better solution:
When computing min the algorithm has to keep track of the min value found so far, until a new value has a lower value (then we update the value).
If instead we keep track of the last n min values found so far will allow to recover the minimum n values.
I can implement this, but I'm wondering if it's the best approach or if it's already implemented.

I don't know in which case it would be less expensive than sorting, but an easy, but not so fast way would be to use the following code. I may be wrong, but I don't think you can get faster with build-in functions if you just want the first and the second min.
A = rand(10);
[firstMin, firstMinIndex] = min(A(:));
secondMin = min(A(A~=firstMin));
secondMinIndex = find(A==secondMin); % slow, but use only if you need the index
Here, you go through the matrix two times more, one for the boolean operation, and one for the second min.
After some testing on 2000x2000 and 4000x4000 random matrix, it seems that this code snipset is around 3.5 time faster than the sort function applied on the same matrix.
If you really need more efficiency, you'd have to write your own mex routine, with which you can theoretically get the two values in n+log n-2 comparison, as explained in the link provided by #luismendotomas.
Hope this help !

In a single pass:
a = [53 53 49 49 97 75 4 22 4 37];
first = Inf;
second = Inf;
for i = 1:1:numel(a)
if (a(i) < first)
second = first;
first = a(i);
elseif (a(i) < second && a(i) ~= first)
second = a(i);
end
end
fprintf('First smallest %d\n', first);
fprintf('Second smallest %d\n', second);
You can remove the a(i) ~= first condition if you rather have 4, 4 as output instead of 4, 23
Also, see this SO question

As already mentioned I suppose the best (read: "most efficient") method is to implement the methods from #luismendotomas link.
However, if you want to avoid doing too much programming yourself, then you could apply some k-nearest neighbours algorithm, given you have a lower bound on your data, e.g. if all your data points are positive, you can find the 2 nearest neighbours to 0. Though I am not sure whether this is faster than your initial suggestions or not.
For one k-nearest neighbour algorithm see e.g. this

beesleep has already pointed out that method 2 (by computing the minimum twice) is more efficient that method 1 (by sorting). However the implementation provided in the answer to compute the index of the second minimum via find is, as mentioned, very inefficient.
In fact, to get the index of the second minimum, it is ca. 10x faster to set the first minimum value to inf (as suggested in the question) and then get the index of the second minimum from the min function (as opposed to using find)
[firstMin, firstMinIndex] = min(A(:));
A(firstMinIndex) = inf;
[secondMin, secondMinIndex] = min(A(:));
Here is the code which I used to compare this implementation to the one suggested by beesleep:
for i = 1:10
A = rand(10000);
tic
[firstMin, firstMinIndex] = min(A(:));
secondMin = min(A(A~=firstMin));
secondMinIndex = find(A==secondMin); % slow, but use only if you need the index
t1(i) = toc;
tic
[firstMin, firstMinIndex] = min(A(:));
A(firstMinIndex) = inf;
[secondMin, secondMinIndex] = min(A(:));
t2(i) = toc;
end
disp(mean(t1) / mean(t2))

Assessing performance of a zero inflated negative binomial model

I am modelling the diffusion of movies through a contact network (based on telephone data) using a zero inflated negative binomial model (package: pscl)
m1 <- zeroinfl(LENGTH_OF_DIFF ~ ., data = trainData, type = "negbin")
(variables described below.)
The next step is to evaluate the performance of the model.
My attempt has been to do multiple out-of-sample predictions and calculate the MSE.
Using
predict(m1, newdata = testData)
I received a prediction for the mean length of a diffusion chain for each datapoint, and using
predict(m1, newdata = testData, type = "prob")
I received a matrix containing the probability of each datapoint being a certain length.
Problem with the evaluation: Since I have a 0 (and 1) inflated dataset, the model would be correct most of the time if it predicted 0 for all the values. The predictions I receive are good for chains of length zero (according to the MSE), but the deviation between the predicted and the true value for chains of length 1 or larger is substantial.
My question is:
How can we assess how well our model predicts chains of non-zero length?
Is this approach the correct way to make predictions from a zero inflated negative binomial model?
If yes: how do I interpret these results?
If no: what alternative can I use?
My variables are:
Dependent variable:
length of the diffusion chain (count [0,36])
Independent variables:
movie characteristics (both dummies and continuous variables).
Thanks!

It is straightforward to evaluate RMSPE (root mean square predictive error), but is probably best to transform your counts beforehand, to ensure that the really big counts do not dominate this sum.
You may find false negative and false positive error rates (FNR and FPR) to be useful here. FNR is the chance that a chain of actual non-zero length is predicted to have zero length (i.e. absence, also known as negative). FPR is the chance that a chain of actual zero length is falsely predicted to have non-zero (i.e. positive) length. I suggest doing a Google on these terms to find a paper in your favourite quantitative journals or a chapter in a book that helps explain these simply. For ecologists I tend to go back to Fielding & Bell (1997, Environmental Conservation).
First, let's define a repeatable example, that anyone can use (not sure where your trainData comes from). This is from help on zeroinfl function in the pscl library:
# an example from help on zeroinfl function in pscl library
library(pscl)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
There are several packages in R that calculate these. But here's the by hand approach. First calculate observed and predicted values.
# store observed values, and determine how many are nonzero
obs <- bioChemists$art
obs.nonzero <- obs > 0
table(obs)
table(obs.nonzero)
# calculate predicted counts, and check their distribution
preds.count <- predict(fm_zinb2, type="response")
plot(density(preds.count))
# also the predicted probability that each item is nonzero
preds <- 1-predict(fm_zinb2, type = "prob")[,1]
preds.nonzero <- preds > 0.5
plot(density(preds))
table(preds.nonzero)
Then get the confusion matrix (basis of FNR, FPR)
# the confusion matrix is obtained by tabulating the dichotomized observations and predictions
confusion.matrix <- table(preds.nonzero, obs.nonzero)
FNR <- confusion.matrix[2,1] / sum(confusion.matrix[,1])
FNR
In terms of calibration we can do it visually or via calibration
# let's look at how well the counts are being predicted
library(ggplot2)
output <- as.data.frame(list(preds.count=preds.count, obs=obs))
ggplot(aes(x=obs, y=preds.count), data=output) + geom_point(alpha=0.3) + geom_smooth(col="aqua")
Transforming the counts to "see" what is going on:
output$log.obs <- log(output$obs)
output$log.preds.count <- log(output$preds.count)
ggplot(aes(x=log.obs, y=log.preds.count), data=output[!is.na(output$log.obs) & !is.na(output$log.preds.count),]) + geom_jitter(alpha=0.3, width=.15, size=2) + geom_smooth(col="blue") + labs(x="Observed count (non-zero, natural logarithm)", y="Predicted count (non-zero, natural logarithm)")
In your case you could also evaluate the correlations, between the predicted counts and the actual counts, either including or excluding the zeros.
So you could fit a regression as a kind of calibration to evaluate this!
However, since the predictions are not necessarily counts, we can't use a poisson
regression, so instead we can use a lognormal, by regressing the log
prediction against the log observed, assuming a Normal response.
calibrate <- lm(log(preds.count) ~ log(obs), data=output[output$obs!=0 & output$preds.count!=0,])
summary(calibrate)
sigma <- summary(calibrate)$sigma
sigma
There are more fancy ways of assessing calibration I suppose, as in any modelling exercise ... but this is a start.
For a more advanced assessment of zero-inflated models, check out the ways in which the log likelihood can be used, in the references provided for the zeroinfl function. This requires a bit of finesse.

Scala: large calculation losing value to zero/infinity

I'm trying calculate a perplexity value for a language model and the calculation uses a lot of large powers. I have tried converting my calculation to log space using BigDecimal, but I'm not having any luck.
var sum=0.0
for(ngram<-testNGrams)
{
var prob = Math.log(lm.prob(ngram.last, ngram.slice(0,ngram.size-1)))
if (prob==0.0) sum = sum
else sum = sum + prob
}
Math.pow(Math.log(Math.exp(sum)),-1.0/wordSize.toDouble)
How can I perform such a calculation in Scala without losing my large/small values to zero/Infinity? It seems like a trivial question but I haven't managed to do it.
In the above, you can assume that the method lm.prob issues the correct probabilities between 0 and 1, this has been amply tested.

Write everything in terms of log probabilities, not probabilities.
For instance, things like log(exp(sum)) just warm up your CPU while throwing away useful information. Avoid!
If you must convert to actual probabilities, do so at the very last step you can.

Random numbers that add to 1 with a minimum increment: Matlab

Having read carefully the previous question
Random numbers that add to 100: Matlab
I am struggling to solve a similar but slightly more complex problem.
I would like to create an array of n elements that sums to 1, however I want an added constraint that the minimum increment (or if you like number of significant figures) for each element is fixed.
For example if I want 10 numbers that sum to 1 without any constraint the following works perfectly:
num_stocks=10;
num_simulations=100000;
temp = [zeros(num_simulations,1),sort(rand(num_simulations,num_stocks-1),2),ones(num_simulations,1)];
weights = diff(temp,[],2);
I foolishly thought that by scaling this I could add the constraint as follows
num_stocks=10;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp2 = [zeros(num_simulations,1),sort(round(rand(num_simulations,num_stocks-1)*scaling)/scaling,2),ones(num_simulations,1)];
weights2 = diff(temp2,[],2);
However though this works for small values of n & small values of increment, if for example n=1,000 & the increment is 0.1% then over a large number of trials the first and last numbers have a mean which is consistently below 0.1%.
I am sure there is a logical explanation/solution to this but I have been tearing my hair out to try & find it & wondered anybody would be so kind as to point me in the right direction. To put the problem into context create random stock portfolios (hence the sum to 1).
Thanks in advance
Thank you for the responses so far, just to clarify (as I think my initial question was perhaps badly phrased), it is the weights that have a fixed increment of 0.1% so 0%, 0.1%, 0.2% etc.
I did try using integers initially
num_stocks=1000;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp = [zeros(num_simulations,1),sort(randi([0 scaling],num_simulations,num_stocks-1),2),ones(num_simulations,1)*scaling];
weights = (diff(temp,[],2)/scaling);
test=mean(weights);
but this was worse, the mean for the 1st & last weights is well below 0.1%.....
Edit to reflect excellent answer by Floris & clarify
The original code I was using to solve this problem (before finding this forum) was
function x = monkey_weights_original(simulations,stocks)
stockmatrix=1:stocks;
base_weight=1/stocks;
r=randi(stocks,stocks,simulations);
x=histc(r,stockmatrix)*base_weight;
end
This runs very fast, which was important considering I want to run a total of 10,000,000 simulations, 10,000 simulations on 1,000 stocks takes just over 2 seconds with a single core & I am running the whole code on an 8 core machine using the parallel toolbox.
It also gives exactly the distribution I was looking for in terms of means, and I think that it is just as likely to get a portfolio that is 100% in 1 stock as it is to geta portfolio that is 0.1% in every stock (though I'm happy to be corrected).
My issue issue is that although it works for 1,000 stocks & an increment of 0.1% and I guess it works for 100 stocks & an increment of 1%, as the number of stocks decreases then each pick becomes a very large percentage (in the extreme with 2 stocks you will always get a 50/50 portfolio).
In effect I think this solution is like the binomial solution Floris suggests (but more limited)
However my question has arrisen because I would like to make my approach more flexible & have the possibility of say 3 stocks & an increment of 1% which my current code will not handle correctly, hence how I stumbled accross the original question on stackoverflow
Floris's recursive approach will get to the right answer, but the speed will be a major issue considering the scale of the problem.
An example of the original research is here
http://www.huffingtonpost.com/2013/04/05/monkeys-stocks-study_n_3021285.html
I am currently working on extending it with more flexibility on portfolio weights & numbers of stock in the index, but it appears my programming & probability theory ability are a limiting factor.......

One problem I can see is that your formula allows for numbers to be zero - when the rounding operation results in two consecutive numbers to be the same after sorting. Not sure if you consider that a problem - but I suggest you think about it (it would mean your model portfolio has fewer than N stocks in it since the contribution of one of the stocks would be zero).
The other thing to note is that the probability of getting the extreme values in your distribution is half of what you want them to be: If you have uniformly distributed numbers from 0 to 1000, and you round them, the numbers that round to 0 were in the interval [0 0.5>; the ones that round to 1 came from [0.5 1.5> - twice as big. The last number (rounding to 1000) is again from a smaller interval: [999.5 1000]. Thus you will not get the first and last number as often as you think. If instead of round you use floor I think you will get the answer you expect.
EDIT
I thought about this some more, and came up with a slow but (I think) accurate method for doing this. The basic idea is this:
Think in terms of integers; rather than dividing the interval 0 - 1 in steps of 0.001, divide the interval 0 - 1000 in integer steps
If we try to divide N into m intervals, the mean size of a step should be N / m; but being integer, we would expect the intervals to be binomially distributed
This suggests an algorithm in which we choose the first interval as a binomially distributed variate with mean (N/m) - call the first value v1; then divide the remaining interval N - v1 into m-1 steps; we can do so recursively.
The following code implements this:
% random integers adding up to a definite sum
function r = randomInt(n, limit)
% returns an array of n random integers
% whose sum is limit
% calls itself recursively; slow but accurate
if n>1
v = binomialRandom(limit, 1 / n);
r = [v randomInt(n-1, limit - v)];
else
r = limit;
end
function b = binomialRandom(N, p)
b = sum(rand(1,N)<p); % slow but direct
To get 10000 instances, you run this as follows:
tic
portfolio = zeros(10000, 10);
for ii = 1:10000
portfolio(ii,:) = randomInt(10, 1000);
end
toc
This ran in 3.8 seconds on a modest machine (single thread) - of course the method for obtaining a binomially distributed random variate is the thing slowing it down; there are statistical toolboxes with more efficient functions but I don't have one. If you increase the granularity (for example, by setting limit=10000) it will slow down more since you increase the number of random number samples that are generated; with limit = 10000 the above loop took 13.3 seconds to complete.
As a test, I found mean(portfolio)' and std(portfolio)' as follows (with limit=1000):
100.20 9.446
99.90 9.547
100.09 9.456
100.00 9.548
100.01 9.356
100.00 9.484
99.69 9.639
100.06 9.493
99.94 9.599
100.11 9.453
This looks like a pretty convincing "flat" distribution to me. We would expect the numbers to be binomially distributed with a mean of 100, and standard deviation of sqrt(p*(1-p)*n). In this case, p=0.1 so we expect s = 9.4868. The values I actually got were again quite close.
I realize that this is inefficient for large values of limit, and I made no attempt at efficiency. I find that clarity trumps speed when you develop something new. But for instance you could pre-compute the cumulative binomial distributions for p=1./(1:10), then do a random lookup; but if you are just going to do this once, for 100,000 instances, it will run in under a minute; unless you intend to do it many times, I wouldn't bother. But if anyone wants to improve this code I'd be happy to hear from them.

Eventually I have solved this problem!
I found a paper by 2 academics at John Hopkins University "Sampling Uniformly From The Unit Simplex"
http://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf
In the paper they outline how naive algorthms don't work, in a way very similar to woodchips answer to the Random numbers that add to 100 question. They then go on to show that the method suggested by David Schwartz can also be slightly biased and propose a modified algorithm which appear to work.
If you want x numbers that sum to y
Sample uniformly x-1 random numbers from the range 1 to x+y-1 without replacement
Sort them
Add a zero at the beginning & x+y at the end
difference them & subtract 1 from each value
If you want to scale them as I do, then divide by y
It took me a while to realise why this works when the original approach didn't and it come down to the probability of getting a zero weight (as highlighted by Floris in his answer). To get a zero weight in the original version for all but the 1st or last weights your random numbers had to have 2 values the same but for the 1st & last ones then a random number of zero or the maximum number would result in a zero weight which is more likely.
In the revised algorithm, zero & the maximum number are not in the set of random choices & a zero weight occurs only if you select two consecutive numbers which is equally likely for every position.
I coded it up in Matlab as follows
function weights = unbiased_monkey_weights(num_simulations,num_stocks,min_increment)
scaling=1/min_increment;
sample=NaN(num_simulations,num_stocks-1);
for i=1:num_simulations
allcomb=randperm(scaling+num_stocks-1);
sample(i,:)=allcomb(1:num_stocks-1);
end
temp = [zeros(num_simulations,1),sort(sample,2),ones(num_simulations,1)*(scaling+num_stocks)];
weights = (diff(temp,[],2)-1)/scaling;
end
Obviously the loop is a bit clunky and as I'm using the 2009 version the randperm function only allows you to generate permutations of the whole set, however despite this I can run 10,000 simulations for 1,000 numbers in 5 seconds on my clunky laptop which is fast enough.
The mean weights are now correct & as a quick test I replicated woodchips generating 3 numbers that sum to 1 with the minimum increment being 0.01% & it also look right
Thank you all for your help and I hope this solution is useful to somebody else in the future

The simple answer is to use the schemes that work well with NO minimum increment, then transform the problem. As always, be careful. Some methods do NOT yield uniform sets of numbers.
Thus, suppose I want 11 numbers that sum to 100, with a constraint of a minimum increment of 5. I would first find 11 numbers that sum to 45, with no lower bound on the samples (other than zero.) I could use a tool from the file exchange for this. Simplest is to simply sample 10 numbers in the interval [0,45]. Sort them, then find the differences.
X = diff([0,sort(rand(1,10)),1]*45);
The vector X is a sample of numbers that sums to 45. But the vector Y sums to 100, with a minimum value of 5.
Y = X + 5;
Of course, this is trivially vectorized if you wish to find multiple sets of numbers with the given constraint.

Java How do you find a complexity class for algorithms?

I have a question to find a complexity class estimate of a algorithm. The question gives recorded times for an algorithm. So, do I just average out the times based on how it was computed?
Sorry, I missed a part.
ok, so it recorded time like N = 100, Algorithm = 300, next N = 200, Algorithm = 604, next N = 400 Algorithm = 1196, next N = 800 Algorithm 2395. So, do i calculate like 300/100, and 604/200 and find the average. Is that how I'm supposed to estimate the complexity class for the algorithm?

Try plotting running time vs. N and see if you get any insight. (e.g. if running time = f(N), is f(N) about equal to log(N), or sqrt(N), or... ?)

I don't think time will help figure out it's complexity class. Times can be very different even on exactly the same task (depends on the scheduler or other factors.)
Look at how many more steps it takes as your input get's larger. So if you had a sorting algorithm that took 100 steps to sort 10 items and 10000 steps to sort 100 items you'd say sorted in big O ( N^2 ) since
Input Steps
10 100 (which equals 10*10)
100 10000 (which equals 100*100)
It's not about averaging but looking for a function that maps the input to the number of steps and then finding what part of that function grows the fastest ( N^2 grows faster than N so if your function was N^2 + N you classify it as N^2).
At least that's how I remember it, but it's been ages!! :)
EDIT: Now that there are more details in your question, here is what I'd do, with the above in mind.
Don't think about averaging anything, just think about how f(100) = 300, f(200)=604, and f(400)=1196.
And it doesn't have to be exact, just in the ball park. So a simple linear function (such as f(x)=3*N ) where f(100)=300, f(200)=600, and f(400)=1200 that would describe the complexity of the algorithm you could say the complexity class of the algorithm was linear or big O(N).
Hope that helps!

Does it give you the inputs to the algorithm as well, which produce the recorded times? You can deduce the growth order according to the input size vs output running time.
i.e. input = 1, running time = 10
input = 100, running time = 100000
would appear to be O(N^2)
i.e.
with input = 1 and running time = 10, likely O(cn) where C = 10
with n = 1, N^2 and N are the same
with input = 10 and running time = 100000, likely O(cN^2) where C = 10
and N = 100*100 = 10000, * 10 = 100000

Hint: Calculate how much time the algorithm spent to process one single item.
How does these calculates time relate to each other?
Does the algorithm alway spent the same time to process one item, regardless how many items, is there a factor? maybe the time raises exponentially?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Can Kmeans total within sum of squares increase with number of clusters? - cluster-analysis

I am seeing an increase in total within Sum of squares when I am using below code.Is this even possible or I am doing some mistake in code? v<-foreach(i = 1:30,.combine = c) %dopar% { iter <- kmeans (clustering_data,centers = i,iter.max = 1000) iter$tot.withinss }

K-means is a randomized algorithm. It does not guarantee to find the optimum. So you simply had a bad random.

Related

Efficient way of computing second min value

Assessing performance of a zero inflated negative binomial model

Scala: large calculation losing value to zero/infinity

Random numbers that add to 1 with a minimum increment: Matlab

Java How do you find a complexity class for algorithms?

Categories

Resources