first centroid on k-means algorithm should be "random"? - cluster-analysis

In the case of k-means.
If 2 initial data sets become centroid value then it will produce cluster 1 and cluster 2 generate value 0 ... why?
Is not this a strange thing?
Measure the distance of the centroid to the distance itself?
Like measuring the distance from america to america itself ...
is there an explanation of the steps in the k-mean algorithm.
Especially in the phrase "in determining the value of the centroid, the initial value of the centroid is done randomly".
Thank you!

Usually all initial centroids are random (today we often use clever random techniques such as k-means++).
Initializing all centroids with the global mean is bad. Never do this!
Because obviously every point will have the same distance to each center. Then usually all points will be assigned to the first cluster (which then will not move), and all the others will be empty. This is the worst possible initialization (for quality, not for runtime)!

Related

Cluster data into good and bad

I need to divide data points into those that are similar to each other("good" points) and everyone else("bad" points).
It looks like some kind of clustering problem and what do I do:
I am assuming that there are at least two "good" points.
Find pairwise distance between all types of points.
Find minimum distance (minDist).
Do hierarchical clustering for all points.
Make a cut at the height of 5*minDist.
Say that all points that are in the same cluster as pair with minDist and under that cut belong to the desired "good" cluster.
And this works pretty well, but if there are two points that are very close to each other. minDist is very small and this 5*minDist cut is also small => only these 2 points are in the desired "good" cluster.
I would think that either I need to change this approach completely and here is question number 1:
[1] "What methods do exist to separate similar points from everyone else?"
Or I need to modify this 5*minDist to some other function of minDist. And question is:
[2] "What may I choose as reasonable alternative to 5*minDist?"
Vladimir
Instead of doing clustering, you want to do outlier detection.
There are dozens of algorithms for this (see ELKI for a large collection). Some very basic methods may solve your problem:
The number of neighbors in radius r. If +i < threshold, the point is an outlier.
Distance to the k nearest neighbor. Choose k>1 to avoid these 2 element clusters you are seeing.
Also, DBSCAN clustering could work for you. Consider all clusters to be good, and only noise to be bad!

Providing Centroids and Then Clustering

I seem to find a lot of documentation based on computing centroids and clustering, but what if I assign centroid values themselves.
Say if I provide 14 different centroid vectors. How would I go about clustering my data to those 14 different centroid values?
Maybe this is an easy question, but I haven't found an answer online, so wanted to make sure.
If the centroids are predefined, then you are doing nearest-neighbor classification, not clustering. It's only clustering if the structure is not predefined.
Not sure this belongs in the python forum, but you just need to compute the distance from each of your points to each centroid, and then assign each point to that centroid that is closest. You then have your clusters, though some may be empty (no guarantee that a centroid will have at least one data point closest to it). You can do this by iterating over all of your points, or do it much more quickly in one step using matrices with numpy. I've got some code lying around somewhere if you need an example to get started.

How do I choose k when using k-means clustering with Silhouette function?

I've been studying about k-means clustering, and one big thing which is not clear is what Silhouette function really tell to me?
i know it shows that what appropriate k should be detemine but i cant understand what mean of silhouette function really say to me?
i read somewhere, if the mean of silhouette is less than 0.5 your clustering is not valid.
thanks for your answers in advance.
From the definition of silhouette :
Silhouette Value
The silhouette value for each point is a measure of how similar that
point is to points in its own cluster compared to points in other
clusters, and ranges from -1 to +1.
The silhouette value for the ith point, Si, is defined as
Si = (bi-ai)/ max(ai,bi) where ai is the average distance from the ith
point to the other points in the same cluster as i, and bi is the
minimum average distance from the ith point to points in a different
cluster, minimized over clusters.
This method just compares the intra-group similarity to closest group similarity. If any data member average distance to other members of the same cluster is higher than average distance to some other cluster members, then this value is negative and clustering is not successful. On the other hand, silhuette values close to 1 indicates a successful clustering operation. 0.5 is not an exact measure for clustering.
#fatihk gave a good citation;
additionally, you may think about the Silhouette value as a degree of
how clusters overlap with each other, i.e. -1: overlap perfectly,
+1: clusters are perfectly separable;
BUT low Silhouette values for a particular algorithm does NOT mean that there are no clusters, rather it means that the algorithm used cannot separate clusters and you may consider tune your algorithm or use a different algorithm (think about K-means for concentric circles, vs DBSCAN).
There is an explicit formula associated with the elbow method to automatically determine the number of clusters. The formula tells you about the strength of the elbow(s) being detected when using the elbow method to determine the number of clusters, see here. See illustration here:
Enhanced Elbow rule

Hierarchical Cluster Analysis in Cluster 3.0

I'm new to this site as well as new to cluster analysis, so I apologize if I violate conventions.
I've been using Cluster 3.0 to perform Hierarchical Cluster Analysis with Euclidean Distance and Average linkage. Cluster 3.0 outputs a .gtr file with a node joining a gene and their similarity score. I've noticed that the first line in the .gtr file always links a gene with another gene followed by the similarity score. But, how do I reproduce this similarity score?
In my data set, I have 8 genes and create a distance matrix where d_{ij} contains the Euclidian distance between gene i and gene j. Then I normalize the matrix by dividing each element by the max value in the matrix. To get the similarity matrix, I subtract all the elements from 1. However, my result does not use the linkage type and differs from the output similarity score.
I am mainly confused how linkages affect the similarity of the first node (the joining of the two closest genes) and how to compute the similarity score.
Thank you!
The algorithm compares clusters using some linkage method, not data points. However, in the first iteration of the algorithm each data point forms its own cluster; this means that your linkage method is actually reduced to the metric you use to measure the distance between data points (for your case Euclidean distance). For subsequent iterations, the distance between clusters will be measured according to your linkage method, which in your case is average link. For two clusters A and B, this is calculated as follows:
where d(a,b) is the Euclidean distance between the two data points. Convince yourself that when A and B contain just one data point (as in the first iteration) this equation reduces itself to d(a,b). I hope this makes things a bit more clear. If not, please provide more details of what exactly you want to do.

Measuring density for three dimensional data (in Matlab)

I have a dataset consisting of a large collection of points in three dimensional euclidian space. In this collection of points, i am trying to find the point that is nearest to the area with the highest density of points.
So my problem consists of two steps:
1: Determine where density of the distribution of points is at its highest
2: Determine which point is nearest to the point found in 1
Point 2 i can manage, but i'm not sure how to solve point 1. I know there are a lot of functions for density estimation in Matlab, but i'm not sure which one would be the most suitable, or straightforward to use.
Does anyone know?
My command of statistics is a little bit rusty, but as far as i can tell, this type of problem calls for multivariate analysis. Someone suggested i use multivariate kernel density estimation, but i'm not really sure if that's the best solution.
Density is a measure of mass per unit volume. On the assumption that your points all have the same mass then you are, I suppose, trying to measure the number of points per unit volume. So one approach is to divide your subset of Euclidean space into lots of little unit volumes (let's call them voxels like everyone does) and count how many points there are in each one. The voxel with the most points is where the density of points is at its highest. This is, of course, numerical integration of a sort. If your points were distributed according to some analytic function (and I guess they are not) you could solve the problem with pencil and paper.
You might make this approach as sophisticated as you like, perhaps initially dividing your space into 2 x 2 x 2 voxels, then choosing the voxel with most points and sub-dividing that in turn until your criteria are satisfied.
I hope this will get you started on your point 1; you seem to be OK with point 2 so I'll stop now.
EDIT
It looks as if triplequad might be what you are looking for.