how to decide splitting a cluster or not? - cluster-analysis

I have given a Cluster. How can i decide splitting the Cluster in two parts is better than the original Cluster?
I have tried using K-Mean with k = 2 and again stuck.. Is it better to spilt or not to spilt?
EDit: Well i dont get the downvotes... A little explanation would be helpful to improve the question :D

The literature proposes different metrics, e.g,
Bayesiqan Information Criterion
Alaine Information Criterion

Related

How to find the most representative/distinguish features in each cluster after doing K-mean clustering?

I tried to use K-mean with a high-dimension dataset (CDR data).
After clustering, I would like to represent each cluster with the most informative features which can show the unique/representative characteristic of customers in that cluster.
For example,
Cluster 1: [High: call_duration], [Low: number_of_friends], [High: call_at_night]
Cluster 2: [Low: call_duration], [High: use_promotion]
Cluster 3: [High: internet_usage]
I would like to know that ...
Question 1: How I can find those informative features which can represent each cluster?
Question 2: If there are many informative features, how to measure which one is more representative?
Another problem is "how to measure whether the value is high or low?"
My current solution is applying z-normalization to every feature in every cluster centroids, then I assume that
<-2σ or >2σ is outlier
(-2σ to -1σ) or (1σ to 2σ) is low/high
-1σ to 1σ is medium
Question 3: Is this measurement make sense? Please give me your suggestion.
Train a decision tree to discriminate the clusters.
Or any other feature selection method for classification, because this is now a classification problem.

Fuzzy clustering without number of clusters

I'm looking for fuzzy clustering algorithm which does not need specified number of clusters. I used hierarchical clustering but it gives results of "hard" clusters. I need something similar but with possibility that one element can be in more than one cluster.
I searched google about it some days ago, it is HFRECCA. I know about FRECCA, but i don't know details about HFRECCA. You can get it from here- http://www.researchgate.net/publication/261081980_Text_Clustering_Using_HFRECCA_and_Rough_K-Means_Clustering_Algorithm

How to cluster categorical variables?

What's the most appropriate family of Machine Learning algorithms for clustering categorical data? Let's assume that we have the following dataset:
V1 V2 V3 V4
"v1a" "v2b" "v3b" "v4c"
"v1b" "v2f" "v3a" "v4c"
"v1a" "v2e" "v3b" "v4c"
Is there any way to cluster them somehow? I am particular interested in doing so through Apache Mahout. Any hint \ idea is highly appreciated.
The question that you need to answer first is:
What is a cluster?
Obviously, many of the existing cluster definitions (connected by steps of Euclidean distance less than epsilon) etc. will not be useful.
There are tricks to vectorize such data so that you can still run k-means on it.
But more often than not, the results will be useless, because people did not consider what they are doing first.
So first try to find out what you want to do, then look for tools to do that.

Mixed variables (categorical and numerical) distance function

I want to fuzzy cluster a set of jobs.
Jobs Attributes are:
Categorical: position,diploma, skills
Numerical : salary , years of experience
My question is: how to calculate the distance between different jobs?
e.g job1(programmer,bs computer science,(java ,.net,responsibility),1500, 3)
and job2(tester,bs computer science,(black and white box testing),1200,1)
PS: I'm beginner in data mining clustering, I highly appreciate your help.
You may take this as your starting point:
http://www.econ.upf.edu/~michael/stanford/maeb4.pdf. Distance between categorical data is nicely explained at the end.
Here is a good walk-through of several different clustering methods and how to use them in R: http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/clustering.pdf
In general, clustering for discrete data is related to either the use of counts (e.g. overlaps in vectors) or related to some statistic derived from counts. As much as I'd like to address the statistical side, I suppose you're interested in the algorithm, so I'll leave it at that.

Cluster analysis with nominal, ordinal and metric data

I got a data set wit nominal, ordinal and metric variables.
I want to perform a cluster analysis,
since I have mixed scales it seems that using k-modes clustering is the most appropriate way to explore the data.
Or has anyone a better way in mind? I am thanksful for any advices!
It's not enough to just make there program run.
It needs to answer the right question. K-means, k-medians, k-medoids, k-modes. Each optimizes a different function. Math won't tell you which function it the best for you. That is the question you need to answer, which function solves your problem?