choose the proper clustering method for Latent Semantic Analysis - cluster-analysis

i want to cluster some text document to find the document with the same concept. i've done the semantic similarity using Latent Semantic Analysis (LSA), but i confuse which clustering method that i should choose for my purpose .
Thank you

You can use hierarchical clustering. There is a package in R called RClusterpp which is very efficient for hierarchical clustering of large data (it does a parallel computation). Then you can cut the dendrogram tree for different number of cluster within the possible range and check for cluster profiles using cross-tab.

Related

How to decide the numbers of clusters based on a distance threshold between clusters for agglomerative clustering with sklearn?

With sklearn.cluster.AgglomerativeClustering from sklearn I need to specify the number of resulting clusters in advance. What I would like to do instead is to merge clusters until a certain maximum distance between clusters is reached and then stop the clustering process.
Accordingly, the number of clusters might vary depending on the structure of the data. I also do not care about the number of resulting clusters nor the size of the clusters but only that the cluster centroids do not exceed a certain distance.
How can I achieve this?
This pull request for a distance_threshold parameter in scikit-learn's agglomerative clustering may be of interest:
https://github.com/scikit-learn/scikit-learn/pull/9069
It looks like it'll be merged in version 0.22.
EDIT: See my answer to my own question for an example of implementing single linkage clustering with a distance based stopping criterion using scipy.
Use scipy directly instead of sklearn. IMHO, it is much better.
Hierarchical clustering is a three step process:
Compute the dendrogram
Visualize and analyze
Extract branches
But that doesn't fit the supervised-learning-oriented API preference of sklearn, which would like everything to implement a fit, predict API...
SciPy has a function for you:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

Using precision recall metric on a hierarchy of recovered clusters

Context: We are two students intending to write a thesis on reverse engineering namespaces using hierarchical agglomerative clustering algorithms. We have a variation of linking methods and other tweaks to the algorithm we want to try out. We will run the algorithm on popular GitHub repositories and compare the created clusters with the originally existent namespaces. Our work will closely follow the works of this paper. In the paper the authors mentions the use of the “precision recall metric” to measure the accuracy of the clustering algorithm. However looking more closely on the metric and its origin, it seems to be dedicated to flat (non-hierarchical) clusters.
Question:
Is there a way to use the precision recall metric to measure the accuracy of a hierarchy of recovered clusters? If not, what other options exists?

Cluster assignment remapping

I have test classification datasets from UCI Machine Learning repository which are labelled.
I am stripping of the labels and using the data to benchmark a few clustering algorithm and then I am planning to use external validation methods. I will run the algorithm with different initial configurations, for say, 50 times and then take the mean value. For 50 iterations the algorithm labels the data points of one single cluster with different numbers. Because in each run the cluster labels can change, also because each iteration might have slightly different cluster assignments, how to somehow remap each of the clusters to one uniform numbering.
Primary idea is to remap by checking how many of the points in the class labels intersect the maximum in the actual labels and then making a remap based on that, but this can get incorrect remappings because when the classes will have more or less equal number of points, this will not work.
Another idea is to keep the labels while clustering, but make the clustering algorithm ignore it. This way all the cluster data will have the label tags. This is doable but I have already have a benchmarked cluster assignment data to be processed therefore I am trying to avoid modifying and re-benchmarking my implementation (which will take quite some time and cpu) of the cluster analysis algorithms and include the label tag to the vectors and then ignore it.
Is there any way that I can compute average accuracy from the cluster assignments I have right now?
EDIT:
The domain in which I am studying (metaheuristic clustering algorithms) I could not find a paper comparing these indexes. The paper which compares seems to be incorrect in their values. Can anyone point me to a paper where clustering results are compared using any of these indexes?
What do you do when the number of clusters doesn't agree?
Do not try to map clusters.
Instead, use the proper external validation measures for clustering, which do not require a 1:1 correspondence of clusters. There are plenty, for details see Wikipedia.

Decision on number of clusters in Data Mining

When ever we want to cluster some data then It is required to give the number of cluster by user. Like K-Means algorithm we need to specify that how cluster are required.
My question is it possible that the algorithm decides itself that how cluster are feasible for particular data set.
There are several clustering algorithms that do not require a desired number of clusters as an input to the algorithm. An example of such an algorithm is the mean-shift clustering algorithm. However, you will need to specify a kernel as an input to the algorithm. This kernel selection (e.g., the size and shape of the kernel) will impact the number of clusters that you get as an output.
Some more information:
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf
http://scikit-learn.org/stable/auto_examples/cluster/plot_mean_shift.html
I'm not expert with that, but to answer to your question, yes there are methods to determine automatically the number of cluster for a kmeans for example.
It's quite complicated but given a dataset and a cluster method you can compute what is called gap statistic in order to estime the number of clusters.
If you are a R user, try to check clusGap and maxSE functions.

Identify clusters in SOM (Self Organizing Map)

Once I have collected and organized data in a SOM how do I identify clusters?
(Items are aggregated and clustered using many traits - upwards of 10)
Specifically I want to find the 'center' of the cluster - therefor giving me the 'center' node(s).
You could use a relative small map and consider each node a cluster, but this is far from optimal. If you want to apply an automated cluster detection method you should definitely read
Clustering of the Self−Organizing Map
and search similar bibliography.
You could also use more sophisticated versions of SOM algorithm (multi leveled, self growing, etc).
In any case, keep in mind that the problem of finding the "correct" number of clusters doesn't have a finite solution.
As far as I can tell, SOM is primarily a data-driven dimensionality reduction and data compression method. So it won't cluster the data for you; it may actually tend to spread clusters in the projection (i.e. split them into multiple cells).
However, it may work well for some data sets to either:
Instead of processing the full data set, work only on the SOM nodes (weighted by the number of elements assigned to them), which should be significantly smaller
Instead of working in the original space, work in the lower-dimensional space that the SOM represents
And then run a regular clustering algorithm on the transformed data.
Though an old question I've encountered the same issue and I've had some success implementing Estimating the Number of Clusters in Multivariate Data by Self-Organizing Maps, so I thought I'd share.
The linked algorithm uses the U-matrix to highlight the boundaries of the individual clusters and then uses an image processing algorithm called watershedding to identify the components. For this to work correctly the regions in the u-matrix are required to be concave within the resolution of your quantization (which when converted to a binary image, simply results in using a floodfill to identify the regions).