In K-means algorithm when k=5 I got Dunn index value 0.05 if k=6 I got Dunn value 0. according to c-index k=5 is good quality but Dunn value is decreasing then how can I interpret this?
Something is completely broken in your code.
A Dunn index of 0 should only arise if every cluster has a second cluster at distance 0. But these then should be merged.
Furthermore, a distance of 0 should only arise if the clusters are all duplicate points. So by that, your data set would have at most 3 different points.
So your code probably is bad.
Related
I am using the polyeig command in Matlab to solve a polynomial eigenvalue problem of order 2 in Matlab. I know that the system has a single 0 eigenvalue (this is due to the form of the zero coefficient matrix where each diagonal element is -1 times the sum of the elements in he same row so the vector (1 1 1 ... 1) has 0 eigenvalue).
Size of the system is about 150 to 150.
When I use the polyeig command the lowest eigenvalue I get is of the order 1E-4 (which is supposed to be the 0 eigenvalue) and the second lowest is of the order 1E-1. As the system size decreases the lowest eigenvalue decreases to something of the order 1E-14 which is reasonable but 1E-4 is too much.
Is there anyway to achieve better accuracy or any other library you would suggest?
I could also turn the polynomial eigenvalue problem to generalized eigenvalue problem in higher dimensions (2 times the given dimension) but I am not sure how that affects speed and accuracy. I would like to see if there is a simpler solution before reformulating the problem. So I would welcome any suggestions on these matters.
EDIT: The problem is resolved it was actually about the precision of the INPUT file that I was using which was printed only up to 4 digits. Having found better ones the precision has increased. Thanks in any case.
The problem turned out to be with the input file I was using which was printed only up to 4 decimal points. Now even with matrices of 800x800 I only get accuracy problems up to e-11 which is good.
I have a vector, say x = [1 1.5 2]. I want to compute the expected distance between that vector and a random permutation of the vector. The assumption is that all permutations are equally likely.
For the example above, the solution should be 4/9. The first element changes 1/2 on average, the second element changes 1/3 on average, and the last one 1/2. The average change is therefore 4/9.
The problem is that this vector has about 50-100 entries. Is there a smart way to compute this expected distance?
I am now using mean(mean(abs(bsxfun(#minus,x,x')))) and this seems to do the trick.
One of the rare cases where bsxfun does not provide the fastest solution. If you want to make use of the symmetry, use pdist
s=sum(pdist(x,'cityblock'))/numel(x).^2*2
If we have a matrix for 6 rows and 10 columns we have to determine the k value.If we assume default k value is 5 and if we have less columns than 5 with same number of rows 6 can we assume that number of columns=k value is it correct?i.e rows=6 cols=4 then k=col-1 => k=3
k=n^(1/2)
Where n is number of instances and not features. reference 1 , reference 2
Check this question, value of k in k nearest neighbour algorithm
Same as the previous one. Usually, the rule of thumb is squareroot of number of features
k=n^(1/2)
where n is the number of features. In your case square-root of 10 is approximately 3, so the answer should be 3.
k=sqrt(n) has not optimal result with the various dataset. Some dataset, its result is quite awful. For example, one paper for 90's paper link says the best result of k is between 5-10 bu sqrt(n) gives us a 17. Some other papers suggest interesting suggestions such as local k value or weighted k. İt's obvious choose k it's not an easy choice. That does not have an easy formula for these and depends on our dataset. Best way to choose optimal k is calculate accuracy of which k is best for our dataset. Generally, if our dataset is getting bigger, optimal k value is also increasing.
I have a variable of equidistant values (suppose values=0:1e-3:1). I want to get the value and its correspondent index from values which is closest to a uniformly random value (suppose x=rand).
I could do [value,vIdx]=min(abs(values-x)), which would be the simplest minimization I could do. Unfortunately the min function won't take advantage from one property from the data, that is to be convex. I don't need to search all indexes, because as soon as find an index that is no more lesser than the previous I've found the global minimum. Said that, I don't want to substitute the matlab min function for a loop that would be slower depending on how distant it is from the value I will start. There are many methods that could be used, as the golden section, but I am not sure if using matlab fmincon would be faster than the min method.
Does anyone have any hints/ideas how to get the required value faster than using the described min method? I'll check the time performance when I have time, but if someone knows a good answer a priori, please let me know.
Possible Application: Snap to nearest graphic data
Since your points are equidistant you can use the value x to find the index:
vIdx = 1+round(x*(numel(values)-1));
The idea is that you are dividing the interval [0, 1] into numel(values)-1 equally sized intervals. Now by multiplying x by that number you map the interval to [0, numel(values)-1] where your points are mapped to integer values. Now using round you get the closest one and by adding 1 you get the one-based index that MATLAB requires.
I am using the kmeans2 algorithm from scipy to cluster pixel colors in an image to get the top average colors in the image.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html#scipy.cluster.vq.kmeans2
I am confused on the meaning for this parameter:
iter : int
Number of iterations of the k-means algrithm to run. Note that this differs in meaning from the iters parameter to the kmeans function.
If I want the kmeans algorithm to run until the clusters don't change, would I set the iter value high? Is there a way to find a best iter value?
The K-means algorithm works by initializing some K points and clustering your data by their distance from those points. Then it iterates by calculating the centroid of each cluster and redefining clusters by distance from the centroid. This isn't guaranteed to converge quickly, though it often does, so it's asking for a maximum iteration value.
edit: maximum iteration value. is incorrect I think, it is literally going to iterate iter times. The default 10 is a common iter value, though.
The higher the iter value the better the clustering. You could try running K-means on some of your data with various iter values and seeing where the time to compute for some gain in cluster quality is too high for your needs.