I have cloud tags A,B,C. each cloud tag consists of entities (words) e,f,g ...
i want to find good words that seperates cloud tags into (mostly) independent clusters. like for example:
word e is with Cloudtag A and B but not C ... so e is a good seperator to get 2 clusters.
Now there are like 100.000 cloudtags and 1.000.000 words. and i want to do the same to get like K cluster. A cloudtag can belong to two clusters, that is not that important.
I know k-means, but i dont know how to transform the data into numerical multi dimensional data. As far as i know kmeans needs numerical points to create clusters.
I also would like to use rapid miner as a software, but any algorithm, software would be quite useful as a basic input.
Thanks in advance.
You don't describe clustering.
But feature (word) selection for "cloud tag" classification.
Have a look at decision trees, and the metrics used there to identify good features for splitting.
Related
I'm working at gene expression data clustering techniques and I have downloaded 35 datasets from web,
We have 35 datasets that each of them represents a type of cancer. Each dataset has its own features. Some of these datasets are shared in several features, and some of them do not share anything from the viewpoint of features.
My question is, how do we ultimately cluster these data, while many of them do not have the same characteristics?
I think that we do the clustering operation on all 35 datasets at the same time.
Is my idea correct?
any help is appreciated.
I assume that when you say heterogenous it'll be things like different gene expression platforms where different genes are present.
You can use any clustering technique, but you'll need to write your own distance metric that takes into account heterogeneity within your dataset. For instance, you could use the correlation of all the genes that are in common between pairwise samples, create a distance matrix from this, then use something like hierarchical clustering on this distance matrix
I think there is no need to write your own distance metric. There already exists plenty of distance metrics that can work for mixed data types. For instance, the gower distance works well for mixed data type. See this post on the same. But if your data contains only continuous values then you can use k-means. You'll also be better off, if the data is preprocessed first.
Context: We are two students intending to write a thesis on reverse engineering namespaces using hierarchical agglomerative clustering algorithms. We have a variation of linking methods and other tweaks to the algorithm we want to try out. We will run the algorithm on popular GitHub repositories and compare the created clusters with the originally existent namespaces. Our work will closely follow the works of this paper. In the paper the authors mentions the use of the “precision recall metric” to measure the accuracy of the clustering algorithm. However looking more closely on the metric and its origin, it seems to be dedicated to flat (non-hierarchical) clusters.
Question:
Is there a way to use the precision recall metric to measure the accuracy of a hierarchy of recovered clusters? If not, what other options exists?
i want to cluster some text document to find the document with the same concept. i've done the semantic similarity using Latent Semantic Analysis (LSA), but i confuse which clustering method that i should choose for my purpose .
Thank you
You can use hierarchical clustering. There is a package in R called RClusterpp which is very efficient for hierarchical clustering of large data (it does a parallel computation). Then you can cut the dendrogram tree for different number of cluster within the possible range and check for cluster profiles using cross-tab.
I'm using WEKA for my thesis and have over 1000 lines of data. The database includes demographical information (Age, Location, status etc.) followed by name of products (valued 1 or 0). The end results is a recommender system.
I used two methods of clustering, K-Means and DBScan.
When using K-means I tried 3 different number of cluster, while using DBscan I chose 3 different epsilons (Epsilon 3 = 48 clusters with ignored 17% of data, Epsilone 2.5 = 19 clusters while cluster 0 holds 229 items with ignored 6%.) Meaning i have 6 different clustering results for same data.
How do I choose what's best suits my data ?
What is "best"?
As some smart people noticed:
the validity of a clustering is often in the eye of the beholder
There is no objectively "better" for clustering, or you are not doing cluster analysis.
Even when a result actually is "better" on some mathematical measure such as separation, silhouette or even when using a supervised evaluation using labels - its still only better at optimizing towards some mathematical goal, not to your use case.
K-means finds a local optimal sum-of-squares assignment for a given k. (And if you increase k, there exists a better assignment!) DBSCAN (it's actually correctly spelled all uppercase) always finds the optimal density-connected components for the given MinPts/Epsilon combination. Yet, both just optimize with respect to some mathematical criterion. Unless this critertion aligns with your requirements, it is worthless. So there is no best, until you know what you need. But if you know what you need, you would not need to do cluster analysis.
So what to do?
Try different algorithms and different parameters and analyze the output with your domain knowledge, if they help you with the problem you are trying to solve. If they help you solving your problem, then they are good. If they do not help, try again.
Over time, you will collect some experience. For example, if the sum-of-squares is meaningless for your domain, don't use k-means. If your data does not have meaningful density, don't use density based clustering such as DBSCAN. It's not that these algorithms fail. They just don't solve your problem, they solve a different problem that you are not interested in. And they might be really good at solving this other problem...
I'm having issue with using OPTICS implementation in ELKI environment. I have used the same data for DBSCAN implementation and it worked like a charm. Probably I'm missing something with parameters but I can't figure it out, everything seems to be right.
Data is a simple 300х2 matrix, consists of 3 clusters with 100 points in each.
DBSCAN result:
Clustering result of DBSCAN
MinPts = 10, Eps = 1
OPTICS result:
Clustering result of OPTICS
MinPts = 10
You apparently already found the solution yourself, but here is the long story:
The OPTICS class in ELKI only computes the cluster order / reachability diagram.
In order to extract clusters, you have different choices, one of which (the one from the original OPTICS publication) is available in ELKI.
So in order to extract clusters in ELKI, you need to use the OPTICSXi algorithm, which will in turn use either OPTICS or the index based DeLiClu to compute the cluster order.
The reason why this is split into two parts in ELKI probably is so that you can on one hand implement another logic for extracting the clusters, and on the other hand implement different methods like DeLiClu for computing the cluster order. That would align well with the modular architecture of ELKI.
IIRC there is at least one more method (apparently not yet in ELKI) that extracts clusters by looking for local maxima, then extending them horizontally until they hit the end of the valley. And there was a different one that used "inflexion points" of the plot.
#AnonyMousse pretty much put it right. I just can't upvote or comment yet.
We hope to have some students contribute the other cluster extraction methods as small student projects over time. They are not essential for our research, but they are good tasks for students that want to learn about ELKI to get started.
ELKI is a fast moving project, and it lives from community contributions. We would be happy to see you contribute some code to it. We know that the codebase is not easy to get started with - it is fairly large, and the generality of the implementation and the support for index structures make it a bit hard to get started. We try to add Tutorials to help you to get started. And once you are used to it, you will actually benefit from the architecture: your algorithms get the benfits of indexing and arbitrary distance functions, while if you would implement from scratch, you would likely only support Euclidean distance, and no index acceleration.
Seeing that you struggled with OPTICS, I will try to write an OPTICS tutorial in the new year. In particular, OPTICS can benefit a lot from using an appropriate index structure.