Center of clusteres in rapidminer - cluster-analysis

I have six features that are clustered using k-means algorithm in Rapidminer, I want detect outlier data from these. there is centroid table in Rapidminer that show the center of each feature in each cluster. I want to detect outlier using cluster method(k-means) so i have avg within centroid distance-cluster but i want to calculate distance between each data from center of cluster. I don't know how to calculate a center point for each cluster with 6 features in rapidminer? and i have 6 feature for each data how calculate a point for each data and calculate distance of each data to center of cluster in rapidminer?

You can use the Cross Distances operator for this. This calculates the distances between all pairs of examples in two example sets. Use the Extract Cluster Prototype operator to find the cluster centroids and connect the output of this to one of the inputs of the Cross Distances operator. The original example set is connected to the other input. You can change the distance measure in this operator used but the default is Euclidean distance.

Related

Partitioning criteria/method to improve clustering according to points grouping structure

I am trying to do some clustering on points based on 2d coordinates.
The point coordinates are here:
points.txt
I use hward clustering for that but the results are not as expected: the point colors on the image below correspond to hward calculated groups, while I would except something that takes more the structure of the points into account (like the groups drawn by hand). I tried the different cluster distances available in scipy (single, complete, average, weighted, centroid, median) but results were not improved.
I am quite new to clustering. Any idea on which method/criteria could help grouping the points similarly to the groups drawn by hand?
Here is an extract of the code that does the clustering using scipy.cluster.hierarchy. Here, points is the array of points coordinates and cutLevel is a tuned parameter based on the highest jump in aggregation distance.
clustering = linkage(points, method = "ward")
clusters = fcluster(clustering, cutLevel, criterion="distance")

Retrieve cluster centers / centroids from linkage matrix

In scipy's hierarchical clustering one can build clusters starting from the linkage matrix Z. For instance,
fcluster(Z, 6,criterion='maxclust' )
would cut the dendrogram so that there will be 6 clusters in the end. Is there a way to get the coordinates of the center of each of those clusters? The position of the centers will differ depending on the metric and method used to build the dendrogram, but I would like to get the centers corresponding to the particular method that was used to build up Z.
Hierarchical clustering does not use centers.
The centers may even be outside of the cluster.
Because of that, you will simply have to call mean yourself to compute centers, if you want centers.

How do I choose k when using k-means clustering with Silhouette function?

I've been studying about k-means clustering, and one big thing which is not clear is what Silhouette function really tell to me?
i know it shows that what appropriate k should be detemine but i cant understand what mean of silhouette function really say to me?
i read somewhere, if the mean of silhouette is less than 0.5 your clustering is not valid.
thanks for your answers in advance.
From the definition of silhouette :
Silhouette Value
The silhouette value for each point is a measure of how similar that
point is to points in its own cluster compared to points in other
clusters, and ranges from -1 to +1.
The silhouette value for the ith point, Si, is defined as
Si = (bi-ai)/ max(ai,bi) where ai is the average distance from the ith
point to the other points in the same cluster as i, and bi is the
minimum average distance from the ith point to points in a different
cluster, minimized over clusters.
This method just compares the intra-group similarity to closest group similarity. If any data member average distance to other members of the same cluster is higher than average distance to some other cluster members, then this value is negative and clustering is not successful. On the other hand, silhuette values close to 1 indicates a successful clustering operation. 0.5 is not an exact measure for clustering.
#fatihk gave a good citation;
additionally, you may think about the Silhouette value as a degree of
how clusters overlap with each other, i.e. -1: overlap perfectly,
+1: clusters are perfectly separable;
BUT low Silhouette values for a particular algorithm does NOT mean that there are no clusters, rather it means that the algorithm used cannot separate clusters and you may consider tune your algorithm or use a different algorithm (think about K-means for concentric circles, vs DBSCAN).
There is an explicit formula associated with the elbow method to automatically determine the number of clusters. The formula tells you about the strength of the elbow(s) being detected when using the elbow method to determine the number of clusters, see here. See illustration here:
Enhanced Elbow rule

How to calculate Density in clustering

I am working with data-set having 2 co-ordinates. Currently I am calculating density by at first calculating total distance from each point to other points and then dividing it by total points. I want to know is this the correct method to calculate density as I am not getting desired result.
This is the cluster file https://dl.dropboxusercontent.com/u/45772222/samp.txt
this cluster should have 3 cluster -> 2 ellipse and one pipe connecting them
any idea how can I separate them?
Now that is a total toy example.
DBSCAN cannot separate clusters of different densities that touch each other. By definition of density connectedness, they must be separated by an area of low density. In your toy example, the two large clusters are actually connected by an area of higher density.
So essentially, this is an example of non-density based clusters... If you want density based clustering to be able to separate these clusters, you must reduce the density of the connecting bar to have a lower density than the clusters. (But maybe don't even bother to use such toy examples at all)

Hierarchical Cluster Analysis in Cluster 3.0

I'm new to this site as well as new to cluster analysis, so I apologize if I violate conventions.
I've been using Cluster 3.0 to perform Hierarchical Cluster Analysis with Euclidean Distance and Average linkage. Cluster 3.0 outputs a .gtr file with a node joining a gene and their similarity score. I've noticed that the first line in the .gtr file always links a gene with another gene followed by the similarity score. But, how do I reproduce this similarity score?
In my data set, I have 8 genes and create a distance matrix where d_{ij} contains the Euclidian distance between gene i and gene j. Then I normalize the matrix by dividing each element by the max value in the matrix. To get the similarity matrix, I subtract all the elements from 1. However, my result does not use the linkage type and differs from the output similarity score.
I am mainly confused how linkages affect the similarity of the first node (the joining of the two closest genes) and how to compute the similarity score.
Thank you!
The algorithm compares clusters using some linkage method, not data points. However, in the first iteration of the algorithm each data point forms its own cluster; this means that your linkage method is actually reduced to the metric you use to measure the distance between data points (for your case Euclidean distance). For subsequent iterations, the distance between clusters will be measured according to your linkage method, which in your case is average link. For two clusters A and B, this is calculated as follows:
where d(a,b) is the Euclidean distance between the two data points. Convince yourself that when A and B contain just one data point (as in the first iteration) this equation reduces itself to d(a,b). I hope this makes things a bit more clear. If not, please provide more details of what exactly you want to do.