clustering a data set, which one is the best choice - cluster-analysis

I want to cluster a data set, but I do not know how many kinds in this data set, which clustering algorithm is better? Can someone give me some suggestions. Thank you very much.

There is no free lunch. And no "best" clustering algorithm.
Cluster analysis is an explorative technique. There are multiple correct answers, and it is up to you to decide which is most useful to you.

Related

what is the best practice for pre-processing before clustering algorithm?

my data contain several features on user level.
and my desire is to cluster them to several groups based on this features
my data is skewed with presence of extreme outliers for of some of the features.
my question is what is the best practice for pre-processing before the clustering algorithm ?
The best practice for clustering is to first figure out how to measure distance reliably. Then many clustering methods can be tried.
But before you can quantify dissimilarity, the data cannot be used for most clustering.

Data is not well clusterd with any clustering approach

When I cluster my data (with any clustering approach) and compute the quality metrics (I tried several metrics, silhouette, Dunn, etc), I get very poor scores.
What I'm interested in is that whether my data is clusterable or not? Is there any methods to assess that? Or a method telling me if the data contain any useful information?
Thanks,
Hamid
Maybe it just doesn't have clusters?
Or they do not fit to the model evaluated by Silhouette, Dunn etc. - these metrics can be quite misleading, in particular when you have noise in your data set, too. Don't blindly trust such metrics.
The best way of seeing if your data can be clustered is visualization. If you can't visualize it in a way you see clusters, how can you expect an algorithm to return meaningful clusters?

Benchmark data-set for protein-protein interaction network dataset

I want to apply my clustering algorithm to specifically on protein-protein interaction(PPI) network. For that I need a benchmark dataset with the reference, so that I can validate my result. Any suggestion will be greatly appreciable.
I advise you to read Evaluation of clustering algorithms for protein-protein interaction networks by van Brohee and van Helden (http://www.ncbi.nlm.nih.gov/pubmed/17087821). I think their benchmark data is available, and they do a very good job of avoiding many of the pitfalls that exist when comparing clustering algorithms and/or gold standards.

Text classification, preprocessing included

Which is the best method for document classification if time is not a factor, and we dont know how many classes there are?
In my (incomplete) knowledge, Hierarchical Agglomerative Clustering is the best approach if you don't know how many classes. All of the other clustering algorithms either require prior knowledge of the number of buckets or some sort of cross-validation or other experimentation to determine a sensible number of buckets.
A cross link: see how-do-i-determine-k-when-using-k-means-clustering on SO.

How do I decide which Neural Network and learning method to use in a particular case?

I am new in neural networks and I need to determine the pattern among a given set of inputs and outputs. So how do I decide which neural network to use for training or even which learning method to use? I have little idea about the pattern or relation between the given input and outputs.
Any sort of help will be appreciated. If you want me to read some stuff then it would be great if links are provided.
If any more info is needed plz say so.
Thanks.
Choosing the right neural networks is something of an art form. It's a bit difficult to give generic suggestions as the best NN for a situation will depend on the problem at hand. As with many of these problems neural netowrks may or may not be the best solution. I'd highly recommned trying out different networks and testing their performance vs a testing data set. When I did this I usually used the ANN tools though the R software package.
Also keep your mind open to other statistical learning techniques as well, things like decision trees and Support Vector Machines may be a better choice for some problems.
I'd suggest the following books:
http://www.amazon.com/Neural-Networks-Pattern-Recognition-Christopher/dp/0198538642
http://www.stats.ox.ac.uk/~ripley/PRbook/#Contents