Given a set of points in Euclidean space, we want to partition them into 2 clusters such that the maximum diameter of all the clusters gets minimized. The definition of cluster diameter is given below:
The diameter of a cluster is the diameter of the minimum enclosing circle of the points on that cluster. (2-center clustering)
The diameter of a cluster is the maximum possible distance between any two points on that cluster. (max diameter 2-clustering)
Consider, we have an exact solution for max diameter 2-clustering. We know that these two clusterings are not equivalent. Is there any way we can find the exact 2-center clustering from the exact solution of max diameter 2-clustering?
Related
There are a number of functions in NetworkX which allow for different types of random graphs to be generated.
Are there any which allow for the specified degree of the nodes as well as the overall network density (or similar metric) to be considered?
There may be other metrics that are possible to specify in the creation of a graph, but for your examples of degree and density, there exists only one combination of node and edge numbers that can meet specified degree an density criteria.
For an undirected graph, the density is calculated as 2*m/(n*(n-1)) where m is the number of edges and n is the number of nodes. The average degree is calculated as 2*m/n.
Using a bit of substitution, we can then say that n = (degree/density) + 1 and m = (n*degree)/2.
With NetworkX, you can use nx.gnm_random_graph() to specify the number of nodes and edges to match those calculated above.
If you use nx.gnp_random_graph(), note that the p parameter is equal to the density of the graph. Density is defined as the number of edges divided by the maximum number of possible edges, so including a probability that a node will attach to any of the other nodes (p) in generating the random graph effectively does the same thing. The resulting number of expected edges and average degree can then be easily calculated using that value and the number of nodes.
Hello I have been assesing clusters in biological genomic data, the issue is that after perfoming a hierarchical clustering, I used silhouette to determine the optimal number of clusters, the silhouette is the following:
AS I understand, if the average silhouette width aproximates to 1, then the number of clusters for that clustering is an optimal one, while if it is close to -1, the elements on the respective clusters may belong to a different one.
In this case the average silhouette width is 0.87 for k =2 but dont know how to interpret the "y" axis.
I used r fviz_silhouette() for producing the visualization
Here is the resulting dendogram of the aglomerative hierarchical clustering analysis if needed:
How can I cluster a data based on some threshold values.
I have some points like 2, 1,0.5. I need to find the nearest elements of each of these 3 points and need to cluster into 3 groups.How can I do that in matlab. Please help me.
If I use some points(like centroids) for clustering1.
Assuming the 2 is a centroid and finding the elements near to that point and clustering the dense region of the point 2.
(calculate the distances from all the centroids
classify the data into the nearest cluster)
I am dealing with k-means clustering 6 records. I am given the centroids and K=3. I have only 2 features. My given centroids are known. as I have only 3 features I am assuming as x,y points and I have plotted them.
Having the points mapped on an x and y axis, finding the euclidean distance I found that lets say (8,6) belongs to the my first cluster. However for all other records, the euclidean distance between the records 2 nearest centroids are the same. So lets say the point (2,6) should belong to the centroid (2,4) or (2,8)?? Or (5,4) belongs to (2,4) or (8,4)??
Thanks for replying
The objective of k-means is to minimize variance.
Therefore, you should assign the point to that cluster, where variance increases the least. Even when cluster centers are at the same distance, the increase in variance by assigning the point can vary, because the cluster center will move due to this change. This is one of the ideas of the very fast Hartigan-Wong algorithm for k-means (as opposed to the slow textbook algorithm).
We know that clustering methods in R assign observations to the closest medoids. Hence, it is supposed to be the closest cluster each observation can have. So, I wonder how it is possible to have negative values of silhouette , while we are supposedly assign each observation to the closest cluster and the formula in silhouette method cannot get negative?
Behnam.
Two errors:
most clustering algorithms do not use the medoid, only PAM does.
the silhouette does not use the distance to the medoid, but the average distance to all cluster members. If the closest cluster is very wide, the average distance can be larger than the distance to the medoid. Consider a cluster with one point in the center, and all others on a sphere around it.