Determine Cluster Label in K-means - cluster-analysis

I have dataset that is contain 150 data that is actually divided into 3 group. Each group has it’s own label.
I do clustering process with K-means algorithm to group the data.
I need to assign the label of each group that is created by K-means process. So I could compare the result of K-means with the data training.
Anybody could help to explain how to determine the label of each group?

Read up on cluster evaluation in Wikipedia.
No clustering algorithm will assign a label such as iris_setosa to the cluster, unless you provide the labels to the clustering algorithm somehow (but then it is no longer clustering, actually, but classification).
So you will only have first_cluster, second_cluster, third_cluster type of labels.
There are various measures proposed to compare the structure of the clusters in comparison to the original data set. But usually there will not be a 1:1 correspondence to the original labels.

Related

Clustering in Weka

I have some data collected using an online survey. Therefore, there are no classes/labels in the data to evaluate clustering results. I am trying to do the clustering in order to cluster participants in some groups for another task.
In the data, I have 10 attributes like: Age, Gender, etc., and 111 examples or data-points.
It's my first time to perform clustering and it's been difficult to find potential clusters in the data.
Here are the steps I have performed in Weka:
I have tried to cluster the data using all attributes, all types of clustering in Weka (like cobweb, EM .. etc) and using different cluster numbers (1-10). And When I visualise the clusters, they don't make any sense and the data are widely spread between x and y axis.
I have applied PCA and selected different number of attribute combinations according to the ranks obtained in PCA. The best clustering result was obtained using k-means and with only 2 combinations of attributes and the number of clusters selected was 3, and seed was 7 (sorry, I have no idea what the seed is).
My Questions:
Are the steps I performed to cluster data correct? If not please give me advice/s
Is this considered as a good clustering result?
How can I optimise or enhance my clusters?
What is meant with seed in Weka clustering?

How to decide the numbers of clusters based on a distance threshold between clusters for agglomerative clustering with sklearn?

With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?
This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.
Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

Cluster assignment remapping

I have test classification datasets from UCI Machine Learning repository which are labelled.
I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly different cluster assignments, how to somehow remap each of the clusters to one uniform numbering.
Primary idea is to remap by checking how many of the points in the class labels intersect the maximum in the actual labels and then making a remap based on that, but this can get incorrect remappings because when the classes will have more or less equal number of points, this will not work.
Another idea is to keep the labels while clustering, but make the clustering algorithm ignore it. This way all the cluster data will have the label tags. This is doable but I have already have a benchmarked cluster assignment data to be processed therefore I am trying to avoid modifying and re-benchmarking my implementation (which will take quite some time and cpu) of the cluster analysis algorithms and include the label tag to the vectors and then ignore it.
Is there any way that I can compute average accuracy from the cluster assignments I have right now?
EDIT:
The domain in which I am studying (metaheuristic clustering algorithms) I could not find a paper comparing these indexes. The paper which compares seems to be incorrect in their values. Can anyone point me to a paper where clustering results are compared using any of these indexes?
What do you do when the number of clusters doesn't agree?
Do not try to map clusters.
Instead, use the proper external validation measures for clustering, which do not require a 1:1 correspondence of clusters. There are plenty, for details see Wikipedia.

clustering vs fitting a mixture model

I have a question about using a clustering method vs fitting the same data with a distribution.
Assuming that I have a dataset with 2 features (feat_A and feat_B) and let's assume that I use a clustering algorithm to divide the data in an optimal number of clusters...say 3.
My goal is to assign for each of the input data [feat_Ai,feat_Bi] a probability (or something similar) that the point belongs to cluster 1 2 3.
a. First approach with clustering:
I cluster the data in the 3 clusters and I assign to each point the probability of belonging to a cluster depending on the distance from the center of that cluster.
b. Second approach using mixture model:
I fit a mixture model or mixture distribution to the data. Data are fit to the distribution using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability.
In my problem I find the cluster centers (or I fit the model if approach b. is used) with a subsample of data. Then I have to assign a probability to a lot of other data... I would like to know in presence of new data which approach is better to use to still have meaningful assignments.
I would go for a clustering method for example a kmean because:
If the new data come from a distribution different from the one used to create the mixture model, the assignment could be not correct.
With new data the posterior probability changes.
The clustering method minimizes the variance of the clusters in order to find a kind of optimal separation border, the mixture model take into consideration the variance of the data to create the model (not sure that the clusters that will be formed are separated in an optimal way).
More info about the data:
Features shouldn't be assumed dependent.
Feat_A represents the duration of a physical activity Feat_B the step counts In principle we could say that with an higher duration of the activity the step counts increase, but it is not always true.
Please help me to think and if you have any other point please let me know..

Decision on number of clusters in Data Mining

When ever we want to cluster some data then It is required to give the number of cluster by user. Like K-Means algorithm we need to specify that how cluster are required.
My question is it possible that the algorithm decides itself that how cluster are feasible for particular data set.
There are several clustering algorithms that do not require a desired number of clusters as an input to the algorithm. An example of such an algorithm is the mean-shift clustering algorithm. However, you will need to specify a kernel as an input to the algorithm. This kernel selection (e.g., the size and shape of the kernel) will impact the number of clusters that you get as an output.
Some more information:
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
http://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html
I'm not expert with that, but to answer to your question, yes there are methods to determine automatically the number of cluster for a kmeans for example.
It's quite complicated but given a dataset and a cluster method you can compute what is called gap statistic in order to estime the number of clusters.
If you are a R user, try to check clusGap and maxSE functions.