Data is not well clusterd with any clustering approach - cluster-analysis

When I cluster my data (with any clustering approach) and compute the quality metrics (I tried several metrics, silhouette, Dunn, etc), I get very poor scores.
What I'm interested in is that whether my data is clusterable or not? Is there any methods to assess that? Or a method telling me if the data contain any useful information?
Thanks,
Hamid

Maybe it just doesn't have clusters?
Or they do not fit to the model evaluated by Silhouette, Dunn etc. - these metrics can be quite misleading, in particular when you have noise in your data set, too. Don't blindly trust such metrics.
The best way of seeing if your data can be clustered is visualization. If you can't visualize it in a way you see clusters, how can you expect an algorithm to return meaningful clusters?

Related

what is the best practice for pre-processing before clustering algorithm?

my data contain several features on user level.
and my desire is to cluster them to several groups based on this features
my data is skewed with presence of extreme outliers for of some of the features.
my question is what is the best practice for pre-processing before the clustering algorithm ?
The best practice for clustering is to first figure out how to measure distance reliably. Then many clustering methods can be tried.
But before you can quantify dissimilarity, the data cannot be used for most clustering.

How to cluster data using self-organising maps?

Suppose that we train a self-organising map (SOM) with a given dataset. Would it make sense to cluster the neurons of the SOM instead of the original datapoints? This doubt came to me after reading this paper, in which the following is stated:
The most important benefit of this procedure
is that computational load decreases considerably, making
it possible to cluster large data sets and to consider several
different preprocessing strategies in a limited time. Naturally,
the approach is valid only if the clusters found using the SOM
are similar to those of the original data.
In this answer it is clearly stated that SOMs don't include clustering, but some clustering procedure can be made on the SOM after it has been trained. I thought that this meant the clustering was done on the neurons of the SOM, which are in some sense a mapping of the original data, but I'm not sure about this. So, what I want to know is:
Is it correct to cluster data performing the clustering algorithm on the trained neuron weights as datapoints? If not, how is clustering done using a SOM then?
What characteristics should a dataset have, in general, for this approach to be useful?
Yes, the usual approach seems to be either hierarchical or k-means (you'll need to dig this up how it was originally done - as seen in the paper you linked, many variants including two-level approaches have been explored later) on the neurons. If you consider SOMs to be a quantization and projection technique, all of these approaches are valid to use.
It's cheaper because they are just 2 dimensional, Euclidean, and much fewer points. So that is well in line with the source that you have.
Note that a SOM neuron may be empty, it it is inbetween of two extremely well separated clusters.

Using precision recall metric on a hierarchy of recovered clusters

Context: We are two students intending to write a thesis on reverse engineering namespaces using hierarchical agglomerative clustering algorithms. We have a variation of linking methods and other tweaks to the algorithm we want to try out. We will run the algorithm on popular GitHub repositories and compare the created clusters with the originally existent namespaces. Our work will closely follow the works of this paper. In the paper the authors mentions the use of the “precision recall metric” to measure the accuracy of the clustering algorithm. However looking more closely on the metric and its origin, it seems to be dedicated to flat (non-hierarchical) clusters.
Question:
Is there a way to use the precision recall metric to measure the accuracy of a hierarchy of recovered clusters? If not, what other options exists?

clustering a data set, which one is the best choice

I want to cluster a data set, but I do not know how many kinds in this data set, which clustering algorithm is better? Can someone give me some suggestions. Thank you very much.
There is no free lunch. And no "best" clustering algorithm.
Cluster analysis is an explorative technique. There are multiple correct answers, and it is up to you to decide which is most useful to you.

What kind of analysis to use in SPSS for finding out groups/grouping?

My research question is about elderly people and I have to find out underlying groups. The data comes from a questionnaire. I have thought about cluster analysis, but the thing is that I would like to search perceived health and which things affect on the perceived health, e.g. what kind of groups of elderly rank their health as bad.
I have some 30 questions I would like to check with the analysis, to see if for example widows have better or worse health than the average. I also have weights in my data so I need to use complex samples.
How can I use an already existing function, or what analysis should I use?
The key challenge you have to solve first is to specify a similarity measure. Once you can measure similarity, various clustering algorithms become available.
But questionnaire data doesn't make a very good vector space, so you can't just use Euclidean distance.
If you want to generate clusters using SPSS, standard options include: k-means, hierarhical cluster analysis, or 2-step. I have some general notes on cluster analysis in SPSS here. See from slide 34.
If you want to see if widows differ in their health, then you need to form a measure of health and compare means on that measure between widows and non-widows (presumably using a between groups t-test). If you have 30 questions related to health, then you may want to do a factor analysis to see how the items group together.
If you are trying to develop a general model of whats predicts perceived health then there are a wide range of modelling options available. Multiple regression would be an obvious starting point. If you have many potential predictors then you have a lot of choices regarding whether you are going to be testing particular models or doing a more data driven model building approach.
More generally, it sounds like you need to clarify the aims of your analyses and the particular hypotheses that you want to test.