How to cluster categorical variables? - cluster-analysis

What's the most appropriate family of Machine Learning algorithms for clustering categorical data? Let's assume that we have the following dataset:
V1 V2 V3 V4
"v1a" "v2b" "v3b" "v4c"
"v1b" "v2f" "v3a" "v4c"
"v1a" "v2e" "v3b" "v4c"
Is there any way to cluster them somehow? I am particular interested in doing so through Apache Mahout. Any hint \ idea is highly appreciated.

The question that you need to answer first is:
What is a cluster?
Obviously, many of the existing cluster definitions (connected by steps of Euclidean distance less than epsilon) etc. will not be useful.
There are tricks to vectorize such data so that you can still run k-means on it.
But more often than not, the results will be useless, because people did not consider what they are doing first.
So first try to find out what you want to do, then look for tools to do that.

Related

how to decide splitting a cluster or not?

I have given a Cluster. How can i decide splitting the Cluster in two parts is better than the original Cluster?
I have tried using K-Mean with k = 2 and again stuck.. Is it better to spilt or not to spilt?
EDit: Well i dont get the downvotes... A little explanation would be helpful to improve the question :D
The literature proposes different metrics, e.g,
Bayesiqan Information Criterion
Alaine Information Criterion

Clustering techniques for Binary Data

I want to use clustering techniques for binary data analysis. I have collected the data through survey in which i asked the users to select exactly 20 features out of list of 94 product features. The columns in my data represents the 94 product features and the rows represents the participants. I am trying to cluster the similar users in different user groups based on the product features they selected. Each user cluster should also tell me the product features associated with each cluster. I am using some open source clustering tools like NCSS and JMP. I was trying to use fuzzy clustering technique for achieveing my goal but unfortunately these tools do not deal with binary data. Can you please suggest me which technique would really be appropriate for my tasks , also which online tool i can use for using the cluster analysis on my data? As beacuse of the time limitation, I am not looking to code myself and i am only looking for some open source tools that have all the functionality available in them which i can use as it is.
Clustering for binary data is not really well defined.
Rather than looking for some tool/function that may or may not work by trial and error, you should first try to answer a 'simple" question:
What is a good cluster, mathematically?
Vague terms not allowed. The next questions to answer then are: I) when is clustering A better than clustering B (I.e. how does the computer compute quality), and ii) how can this be found efficiently.
You won't get far if you don't understand what you are doing just by calling random functions...
Also, is clustering actually what you are looking for? Most of the time with binary data e.g. frequent itemset mining is the better choice.

Fuzzy clustering without number of clusters

I'm looking for fuzzy clustering algorithm which does not need specified number of clusters. I used hierarchical clustering but it gives results of "hard" clusters. I need something similar but with possibility that one element can be in more than one cluster.
I searched google about it some days ago, it is HFRECCA. I know about FRECCA, but i don't know details about HFRECCA. You can get it from here- http://www.researchgate.net/publication/261081980_Text_Clustering_Using_HFRECCA_and_Rough_K-Means_Clustering_Algorithm

framework for distributed algorithm

I have to do a project where I have a dynamic graph and each node execute my algorithm to calculate the pagerank.
My question is: There is a framwork that allows me to run an algorithm in the same time in each node (the algorithm is not centralized)?
Yes, Giraph is probably the most common example for it and can do exactly what you are looking for. However it isn't trivial to set up, there is a question from yesterday on SO about materials for Giraph: https://stackoverflow.com/questions/22817423/material-related-to-giraph/
Another example would be GraphX (http://amplab.github.io/graphx/) from spark and GraphLab (http://graphlab.org/projects/index.html), but I don't have any experience with those. However all of those frameworks enable writing code for a node and execute it for each node in a graph. They also allow you to distribute the algorithm across multiple servers for large graphs, but it isn't necessary if your graph is small enough.

Cluster analysis with nominal, ordinal and metric data

I got a data set wit nominal, ordinal and metric variables.
I want to perform a cluster analysis,
since I have mixed scales it seems that using k-modes clustering is the most appropriate way to explore the data.
Or has anyone a better way in mind? I am thanksful for any advices!
It's not enough to just make there program run.
It needs to answer the right question. K-means, k-medians, k-medoids, k-modes. Each optimizes a different function. Math won't tell you which function it the best for you. That is the question you need to answer, which function solves your problem?