How to calculate Density in clustering - cluster-analysis

I am working with data-set having 2 co-ordinates. Currently I am calculating density by at first calculating total distance from each point to other points and then dividing it by total points. I want to know is this the correct method to calculate density as I am not getting desired result.
This is the cluster file https://dl.dropboxusercontent.com/u/45772222/samp.txt
this cluster should have 3 cluster -> 2 ellipse and one pipe connecting them
any idea how can I separate them?

Now that is a total toy example.
DBSCAN cannot separate clusters of different densities that touch each other. By definition of density connectedness, they must be separated by an area of low density. In your toy example, the two large clusters are actually connected by an area of higher density.
So essentially, this is an example of non-density based clusters... If you want density based clustering to be able to separate these clusters, you must reduce the density of the connecting bar to have a lower density than the clusters. (But maybe don't even bother to use such toy examples at all)

Related

Weighted clustering in ML.net without training

I'm looking to see if I can use ML.net to cluster a one-off set of weighted points that I don't have any training data for. I just need the points divided into a specified number of clusters, with each weighted cluster size being with a specified range.
Is that something ML.net can do? If so, how?

Clustering of 3D points

I have a large dataset of around 20 million points (x,y,z) in a 3-dimensional space. I know these points are organized in dense regions, but that these regions vary in size. I think a standard unsupervised 3D clustering should solve my problem.
Since I can't estimate the number of clusters a priori, I tried using k-means with a wide range for k, but it is slow and also, I would have to estimate how significant each k-partition is.
Basically, my question is: how can I extract the most significant partition of my points into clusters?
k-means is probably not the best alhorithm for such data.
DBSCAN should be closer to your intuition of dense regions.
Try on a sample first, then figure out how to scale up.
It is not clear to me from the above if you're going to use k-means or not, but if you are, you should be following the responses from the post below which shows how to measure variance of the clusters.
Calculating the percentage of variance measure for k-means?
Additionally, you can get a good fit using 'the elbow method' by trying 2 to 15 k sized clusters. See the answer from Amro for the process on this.
One simple idea in this case is to use 3 different clusterings, along each dimension. That might speed things up.
So you find clusters along X axis (project all the points down to X axis) and then continue to form sub clusters along the Y axis and then along the Z axis.
I think 1-D k-means can be solved very efficiently using dynamic programming http://www.sciencedirect.com/science/article/pii/0025556473900072.

Center of clusteres in rapidminer

I have six features that are clustered using k-means algorithm in Rapidminer, I want detect outlier data from these. there is centroid table in Rapidminer that show the center of each feature in each cluster. I want to detect outlier using cluster method(k-means) so i have avg within centroid distance-cluster but i want to calculate distance between each data from center of cluster. I don't know how to calculate a center point for each cluster with 6 features in rapidminer? and i have 6 feature for each data how calculate a point for each data and calculate distance of each data to center of cluster in rapidminer?
You can use the Cross Distances operator for this. This calculates the distances between all pairs of examples in two example sets. Use the Extract Cluster Prototype operator to find the cluster centroids and connect the output of this to one of the inputs of the Cross Distances operator. The original example set is connected to the other input. You can change the distance measure in this operator used but the default is Euclidean distance.

How do I choose k when using k-means clustering with Silhouette function?

I've been studying about k-means clustering, and one big thing which is not clear is what Silhouette function really tell to me?
i know it shows that what appropriate k should be detemine but i cant understand what mean of silhouette function really say to me?
i read somewhere, if the mean of silhouette is less than 0.5 your clustering is not valid.
thanks for your answers in advance.
From the definition of silhouette :
Silhouette Value
The silhouette value for each point is a measure of how similar that
point is to points in its own cluster compared to points in other
clusters, and ranges from -1 to +1.
The silhouette value for the ith point, Si, is defined as
Si = (bi-ai)/ max(ai,bi) where ai is the average distance from the ith
point to the other points in the same cluster as i, and bi is the
minimum average distance from the ith point to points in a different
cluster, minimized over clusters.
This method just compares the intra-group similarity to closest group similarity. If any data member average distance to other members of the same cluster is higher than average distance to some other cluster members, then this value is negative and clustering is not successful. On the other hand, silhuette values close to 1 indicates a successful clustering operation. 0.5 is not an exact measure for clustering.
#fatihk gave a good citation;
additionally, you may think about the Silhouette value as a degree of
how clusters overlap with each other, i.e. -1: overlap perfectly,
+1: clusters are perfectly separable;
BUT low Silhouette values for a particular algorithm does NOT mean that there are no clusters, rather it means that the algorithm used cannot separate clusters and you may consider tune your algorithm or use a different algorithm (think about K-means for concentric circles, vs DBSCAN).
There is an explicit formula associated with the elbow method to automatically determine the number of clusters. The formula tells you about the strength of the elbow(s) being detected when using the elbow method to determine the number of clusters, see here. See illustration here:
Enhanced Elbow rule

MATLAB: Self-Organizing Map (SOM) clustering

I'm trying to cluster some images depending on the angles between body parts.
The features extracted from each image are:
angle1 : torso - torso
angle2 : torso - upper left arm
..
angle10: torso - lower right foot
Therefore the input data is a matrix of size 1057x10, where 1057 stands for the number of images, and 10 stands for angles of body parts with torso.
Similarly a testSet is 821x10 matrix.
I want all the rows in input data to be clustered with 88 clusters.
Then I will use these clusters to find which clusters does TestData fall into?
In a previous work, I used K-Means clustering which is very straightforward. We just ask K-Means to cluster the data into 88 clusters. And implement another method that calculates the distance between each row in test data and the centers of each cluster, then pick the smallest values. This is the cluster of the corresponding input data row.
I have two questions:
Is it possible to do this using SOM in MATLAB?
AFAIK SOM's are for visual clustering. But I need to know the actual class of each cluster so that I can later label my test data by calculating which cluster it belongs to.
Do you have a better solution?
Self-Organizing Map (SOM) is a clustering method considered as an unsupervised variation of the Artificial Neural Network (ANN). It uses competitive learning techniques to train the network (nodes compete among themselves to display the strongest activation to a given data)
You can think of SOM as if it consists of a grid of interconnected nodes (square shape, hexagonal, ..), where each node is an N-dim vector of weights (same dimension size as the data points we want to cluster).
The idea is simple; given a vector as input to SOM, we find the node closet to it, then update its weights and the weights of the neighboring nodes so that they approach that of the input vector (hence the name self-organizing). This process is repeated for all input data.
The clusters formed are implicitly defined by how the nodes organize themselves and form a group of nodes with similar weights. They can be easily seen visually.
SOM are in a way similar to the K-Means algorithm but different in that we don't impose a fixed number of clusters, instead we specify the number and shape of nodes in the grid that we want it to adapt to our data.
Basically when you have a trained SOM, and you want to classify a new test input vector, you simply assign it to the nearest (distance as a similarity measure) node on the grid (Best Matching Unit BMU), and give as prediction the [majority] class of the vectors belonging to that BMU node.
For MATLAB, you can find a number of toolboxes that implement SOM:
The Neural Network Toolbox from MathWorks can be used for clustering using SOM (see the nctool clustering tool).
Also worth checking out is the SOM Toolbox